Network Working Group                                        M. Blanchet
Internet-Draft                                                  Viagenie
Obsoletes: 3454 (if approved)                               July 5, 2010
Intended status: Standards Track
Expires: January 6, 2011


   Precis Framework: Handling Internationalized Strings in Protocols
                 draft-blanchet-precis-framework-00.txt

Abstract

   Using Unicode codepoints in protocol strings requires preparation of
   the string.  This document describes the Precis Protocol Framework
   that prepares various classes of strings used in protocol elements.
   A protocol specification chooses a class of strings and then
   implements the corresponding preparation steps described in this
   document.  This document is based on the IDNAbis approach.  It
   obsoletes the Stringprep algorithm.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 6, 2011.

Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must


Blanchet                 Expires January 6, 2011                [Page 1]

Internet-Draft              Precis Framework                   July 2010


   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008.  The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.


Blanchet                 Expires January 6, 2011                [Page 2]

Internet-Draft              Precis Framework                   July 2010


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
   2.  String Classes . . . . . . . . . . . . . . . . . . . . . . . .  4
   3.  Domain U-Label, A-Label and Name . . . . . . . . . . . . . . .  5
   4.  Email Addresses  . . . . . . . . . . . . . . . . . . . . . . .  5
   5.  Restricted Identifier  . . . . . . . . . . . . . . . . . . . .  5
   6.  Less-Restrictive Identifier  . . . . . . . . . . . . . . . . .  5
   7.  Normalization Form and Case Folding  . . . . . . . . . . . . .  5
   8.  Codepoint Properties . . . . . . . . . . . . . . . . . . . . .  5
   9.  Category definitions Used to Calculate Derived Property
       Value  . . . . . . . . . . . . . . . . . . . . . . . . . . . .  7
     9.1.  LetterDigits (A) . . . . . . . . . . . . . . . . . . . . .  7
     9.2.  Unstable (B) . . . . . . . . . . . . . . . . . . . . . . .  8
     9.3.  IgnorableProperties (C)  . . . . . . . . . . . . . . . . .  8
     9.4.  IgnorableBlocks (D)  . . . . . . . . . . . . . . . . . . .  8
     9.5.  LDH (E)  . . . . . . . . . . . . . . . . . . . . . . . . .  9
     9.6.  Exceptions (F) . . . . . . . . . . . . . . . . . . . . . .  9
     9.7.  BackwardCompatible (G) . . . . . . . . . . . . . . . . . . 10
     9.8.  JoinControl (H)  . . . . . . . . . . . . . . . . . . . . . 10
     9.9.  OldHangulJamo (I)  . . . . . . . . . . . . . . . . . . . . 11
     9.10. Unassigned (J) . . . . . . . . . . . . . . . . . . . . . . 11
   10. Calculation of the Derived Property  . . . . . . . . . . . . . 11
   11. Codepoints . . . . . . . . . . . . . . . . . . . . . . . . . . 12
   12. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 12
     12.1. IDNA derived property value registry . . . . . . . . . . . 12
     12.2. IDNA Context Registry  . . . . . . . . . . . . . . . . . . 12
       12.2.1.  Template for context registry . . . . . . . . . . . . 13
   13. Security Considerations  . . . . . . . . . . . . . . . . . . . 13
   14. Discussion home for this draft . . . . . . . . . . . . . . . . 13
   15. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13
   Appendix A.   Contextual Rules Registry  . . . . . . . . . . . . . 13
   Appendix A.1. ZERO WIDTH NON-JOINER  . . . . . . . . . . . . . . . 16
   Appendix A.2. ZERO WIDTH JOINER  . . . . . . . . . . . . . . . . . 16
   Appendix A.3. MIDDLE DOT . . . . . . . . . . . . . . . . . . . . . 16
   Appendix A.4. GREEK LOWER NUMERAL SIGN (KERAIA)  . . . . . . . . . 17
   Appendix A.5. HEBREW PUNCTUATION GERESH  . . . . . . . . . . . . . 17
   Appendix A.6. HEBREW PUNCTUATION GERSHAYIM . . . . . . . . . . . . 17
   Appendix A.7. KATAKANA MIDDLE DOT  . . . . . . . . . . . . . . . . 18
   Appendix A.8. ARABIC-INDIC DIGITS  . . . . . . . . . . . . . . . . 18
   Appendix A.9. EXTENDED ARABIC-INDIC DIGITS . . . . . . . . . . . . 18
   Appendix B.   Codepoints 0x0000 - 0x10FFFF . . . . . . . . . . . . 19
   Appendix B.1. Codepoints in Unicode Character Database (UCD)
                 format . . . . . . . . . . . . . . . . . . . . . . . 19
   16. Informative References . . . . . . . . . . . . . . . . . . . . 19
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 20


Blanchet                 Expires January 6, 2011                [Page 3]

Internet-Draft              Precis Framework                   July 2010


1.  Introduction

   [draft-ietf-blanchet-newprep-problem-statement] describes the
   rationale behind updating Stringprep[RFC3454] to a new framework.

   Current Stringprep profiles and their corresponding protocol
   specifications share similar class of strings.  This framework is
   based on the assumption that the use of internationalized strings in
   most protocols can be grouped into a few set of string classes.  By
   defining a few string classes and their corresponding preparation
   algorithms instead of specific profiles for each protocol,
   o  protocols specifications do not need to have a special i18n
      section or implementation, since they would reference one of this
      document string classes and corresponding processing.
   o  protocols benefit for sharing implementation code and tables.
   o  end-users will have a better knowledge of which codepoints are
      allowed in various contexts (instead of a specific profile per
      protocol as of with Stringprep profiles
   o  versioning for future versions of the Unicode database is simpler
   o  protocols that have familiarity with others (such as username
      identifiers used in various authentication schemes in protocols)
      can use the same string class, and therefore obtain consistency
      for end-users and implementors.

   This framework takes heavily on the IDNAbis tables[IDNABISTABLES],
   therefore, could help implementors by sharing common code for all
   string classes, including domain labels and names.

   EDITOR NOTE:This current version of the document copy a lot of
   normative text from draft-ietf-idnabis-tables.  The editor would
   highly prefer reference instead of copy, but at least for the purpose
   of discussion, copied text.  Moreover, the idnabis-table draft
   contains references to IDN labels in many places which may make
   problematic for normative reference.  To be looked at as we go.


2.  String Classes

   The following classes of strings are identified:
   o  domain U-label
   o  domain A-label
   o  domain name
   o  email address
   o  restricted identifier
   o  less-restrictive identifier


Blanchet                 Expires January 6, 2011                [Page 4]

Internet-Draft              Precis Framework                   July 2010


3.  Domain U-Label, A-Label and Name

   TBD:define the class.

   For these string classes, implement [IDNA2008].


4.  Email Addresses

   TBD:define the class by instantiating and refering to the EAI, SMTP.

   For this classes of strings, implement [EAI]?


5.  Restricted Identifier

   This class of strings, named RI in this document, corresponds to an
   identifier which contains language-type characters, no spacing
   characters, no "@", no "punctuation", no display characters.  The
   normative description of this class is in the corresponding mapping
   tables.

   In section XX below, allowed Unicode codepoints for this string class
   are identified as PVALID or RI_PVALID.  Disallowed codepoints are
   identified as DISALLOWED or RI_DISALLOWED.


6.  Less-Restrictive Identifier

   This class of strings, named LRI in this document, corresponds to an
   identifier which contains language-type characters, no spacing
   characters, no "@", but contains various "punctuation" and display
   characters.  The normative description of this class is in the
   corresponding mapping tables.

   In section XX below, allowed Unicode codepoints for this string class
   are identified as PVALID or LRI_PVALID.  Disallowed codepoints are
   identified as DISALLOWED or LRI_DISALLOWED.


7.  Normalization Form and Case Folding

   TBD: discuss NFC vs NFKC, case folding",


8.  Codepoint Properties

   This document reviews and classifies the collections of code points


Blanchet                 Expires January 6, 2011                [Page 5]

Internet-Draft              Precis Framework                   July 2010


   in the Unicode character set by examining various properties of the
   code points.  It then defines an algorithm for determining a derived
   property value.  It specifies a procedure, and not a table, of code
   points so that the algorithm can be used to determine code point sets
   independent of the version of Unicode that is in use.

   This document is not intended to specify precisely how these property
   values are to be applied in protocol strings.  That information
   should be defined in the protocol specification that instantiate a
   string class of this document.

   The value of the property is to be interpreted as follows.

   o  PROTOCOL VALID: Those that are allowed to be used in any string
      class.  Code points with this property value are permitted for
      general use in any string class.  The abbreviated term PVALID is
      used to refer to this value in the rest of this document.
   o  SPECIFIC CLASS PROTOCOL VALID: Those that are allowed to be used
      in specific string classes.  Code points with this property value
      are permitted for use in specific string classes.  The abbreviated
      term *_PVALID, where * = (RI, LRI) is used to refer to this value
      in the rest of this document.
   o  CONTEXTUAL RULE REQUIRED: Some characteristics of the character,
      such as it being invisible in certain contexts or problematic in
      others, requires that it not be used in labels unless specific
      other characters or properties are present.  The abbreviated term
      CONTEXT is used to refer to this value in the rest of this
      document.  There are two subdivisions of CONTEXTUAL RULE REQUIRED,
      one for Join_controls (called CONTEXTJ) and for other characters
      (called CONTEXTO).
   o  DISALLOWED: Those that should clearly not be included in any
      string class.  Code points with this property value are not
      permitted in any string class.
   o  SPECIFIC CLASS DISALLOWED: Those that should clearly not be
      included in specific string classes.  Code points with this
      property value are not permitted in any string class.  The
      abbreviated term *_DISALLOWED, where * = (RI, LRI) is used to
      refer to this value in the rest of this document.
   o  UNASSIGNED: Those code points that are not designated (i.e. are
      unassigned) in the Unicode Standard.

   The mechanisms described here allow determination of the value of the
   property for future versions of Unicode (including characters added
   after Unicode 5.2).  Changes in Unicode properties that do not affect
   the outcome of this process do not affect this framework.  For
   example, a character can have its Unicode General_Category value (see
   [Unicode52]) change from So to Sm, or from Lo to Ll, without
   affecting the algorithm results.  Moreover, even if such changes were


Blanchet                 Expires January 6, 2011                [Page 6]

Internet-Draft              Precis Framework                   July 2010


   to result, the BackwardCompatible list (Section 9.7) can be adjusted
   to ensure the stability of the results.

   Some code points need to be allowed in exceptional circumstances, but
   should be excluded in all other cases; these rules are also described
   in other documents.  The most notable of these are the Join Control
   characters, U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH NON-
   JOINER.  Both of them have the derived property value CONTEXTJ.  A
   character with the derived property value CONTEXTJ or CONTEXTO
   (CONTEXTUAL RULE REQUIRED) is not to be used unless an appropriate
   rule has been established and the context of the character is
   consistent with that rule.  It is invalid to either register a string
   containing these characters or even to look one up unless such
   contextual rule is found and satisfied.  Please see Appendix A, The
   Contextual Rules Registry, for more information.


9.  Category definitions Used to Calculate Derived Property Value

   The derived property obtains its value based on a two-step procedure.
   First, characters are placed in one or more character categories
   based on either core properties defined by the Unicode Standard or by
   treating the codepoint as an exception and addressing the codepoint
   by its codepoint value.  These categories are not mutually exclusive.

   In the second step, set operations are used with these categories to
   determine the values for an string class specific property.  Those
   operations are specified in Section 10.

   Unicode property names and property value names may have short
   abbreviations, such as gc for the General_Category property, and Ll
   for the Lowercase_Letter property value of the gc property.

   In the following specification of categories, the operation which
   returns the value of a particular Unicode character property for a
   code point is designated by using the formal name of that property
   (from PropertyAliases.txt) followed by '(cp)'.  For example, the
   value of the General_Category property for a code point is indicated
   by General_Category(cp).

9.1.  LetterDigits (A)

   A: General_Category(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc}

   These rules identifies characters commonly used in mnemonics and
   often informally described as "language characters".

   For more information, see section 4.5 of [Unicode5].


Blanchet                 Expires January 6, 2011                [Page 7]

Internet-Draft              Precis Framework                   July 2010


   The categories used in this rule are:
   o  Ll - Lowercase_Letter
   o  Lu - Uppercase_Letter
   o  Lo - Other_Letter
   o  Nd - Decimal_Number
   o  Lm - Modifier_Letter
   o  Mn - Nonspacing_Mark
   o  Mc - Spacing_Mark

9.2.  Unstable (B)

   B: toNFKC(toCaseFold(toNFKC(cp))) != cp

   This category is used to group the characters that are not stable
   under NFKC normalization and casefolding.  In general, these code
   points are not suitable for use in any string class.

   The toCaseFold() operation is defined in Section 3.13 of [Unicode5].

   The toNFKC() operation returns the code point in normalization form
   KC.  For more information, see Section 5 of [TR15].

9.3.  IgnorableProperties (C)

   C: Default_Ignorable_Code_Point(cp) = True or
      White_Space(cp) = True or
      Noncharacter_Code_Point(cp) = True

   This category is used to group code points that are not recommended
   for use in identifiers.  In general, these code points are not
   suitable for identifiers.

   The definition for Default_Ignorable_Code_Point can be found in
   DerivedCoreProperties.txt [1] and is at the time of Unicode 5.2:

   Other_Default_Ignorable_Code_Point + Cf (Format characters)
   + Variation_Selector - White_Space - FFF9..FFFB (Annotation
   Characters) - 0600..0603, 06DD, 070F (exceptional Cf characters
   that should be visible)

9.4.  IgnorableBlocks (D)

   D: Block(cp) is in {Combining Diacritical Marks for Symbols,
                       Musical Symbols, Ancient Greek Musical Notation}

   This category is used to identifying code points that are not useful
   in mnemonics but may be useful for some string classes.


Blanchet                 Expires January 6, 2011                [Page 8]

Internet-Draft              Precis Framework                   July 2010


   The definition of blocks can be found in Blocks.txt [2]

9.5.  LDH (E)

   E: cp is in {002D, 0030..0039, 0061..007A}

   This category is used in the second step to preserve the traditional
   "hostname" (LDH) characters ('-', 0-9 and a-z).  In general, these
   code points are suitable for use for identifiers.

9.6.  Exceptions (F)

   F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660,
                0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668,
                0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6,
                06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 0F0B, 3007,
                302E, 302F, 3031, 3032, 3033, 3034, 3035, 303B,
                30FB}

   This category explicitly lists code points for which the category
   cannot be assigned using only the core property values that exist in
   the Unicode standard.  The values are according to the table below:

 PVALID -- Would otherwise have been DISALLOWED

 00DF; PVALID     # LATIN SMALL LETTER SHARP S
 03C2; PVALID     # GREEK SMALL LETTER FINAL SIGMA
 06FD; PVALID     # ARABIC SIGN SINDHI AMPERSAND
 06FE; PVALID     # ARABIC SIGN SINDHI POSTPOSITION MEN
 0F0B; PVALID     # TIBETAN MARK INTERSYLLABIC TSHEG
 3007; PVALID     # IDEOGRAPHIC NUMBER ZERO

 CONTEXTO -- Would otherwise have been DISALLOWED

 00B7; CONTEXTO   # MIDDLE DOT
 0375; CONTEXTO   # GREEK LOWER NUMERAL SIGN (KERAIA)
 05F3; CONTEXTO   # HEBREW PUNCTUATION GERESH
 05F4; CONTEXTO   # HEBREW PUNCTUATION GERSHAYIM
 30FB; CONTEXTO   # KATAKANA MIDDLE DOT

 CONTEXTO -- Would otherwise have been PVALID

 0660; CONTEXTO   # ARABIC-INDIC DIGIT ZERO
 0661; CONTEXTO   # ARABIC-INDIC DIGIT ONE
 0662; CONTEXTO   # ARABIC-INDIC DIGIT TWO
 0663; CONTEXTO   # ARABIC-INDIC DIGIT THREE
 0664; CONTEXTO   # ARABIC-INDIC DIGIT FOUR
 0665; CONTEXTO   # ARABIC-INDIC DIGIT FIVE


Blanchet                 Expires January 6, 2011                [Page 9]

Internet-Draft              Precis Framework                   July 2010


 0666; CONTEXTO   # ARABIC-INDIC DIGIT SIX
 0667; CONTEXTO   # ARABIC-INDIC DIGIT SEVEN
 0668; CONTEXTO   # ARABIC-INDIC DIGIT EIGHT
 0669; CONTEXTO   # ARABIC-INDIC DIGIT NINE
 06F0; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT ZERO
 06F1; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT ONE
 06F2; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT TWO
 06F3; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT THREE
 06F4; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT FOUR
 06F5; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT FIVE
 06F6; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT SIX
 06F7; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT SEVEN
 06F8; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT EIGHT
 06F9; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT NINE

 DISALLOWED -- Would otherwise have been PVALID

 0640; DISALLOWED # ARABIC TATWEEL
 07FA; DISALLOWED # NKO LAJANYALAN
 302E; DISALLOWED # HANGUL SINGLE DOT TONE MARK
 302F; DISALLOWED # HANGUL DOUBLE DOT TONE MARK
 3031; DISALLOWED # VERTICAL KANA REPEAT MARK
 3032; DISALLOWED # VERTICAL KANA REPEAT WITH VOICED SOUND MARK
 3033; DISALLOWED # VERTICAL KANA REPEAT MARK UPPER HALF
 3034; DISALLOWED # VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HA
 3035; DISALLOWED # VERTICAL KANA REPEAT MARK LOWER HALF
 303B; DISALLOWED # VERTICAL IDEOGRAPHIC ITERATION MARK

9.7.  BackwardCompatible (G)

   G: cp is in {}

   This category includes the code points that property values in
   versions of Unicode after 5.2 have changed in such a way that the
   derived property value would no longer be PVALID or DISALLOWED.  If
   changes are made to future versions of Unicode so that code points
   might change property value from PVALID or DISALLOWED, then this
   table can be updated and keep special exception values so that the
   property values for code points stay stable.

9.8.  JoinControl (H)

   H: Join_Control(cp) = True

   This category consists of Join Control characters (i.e., they are not
   in LetterDigits (Section 9.1)) but are still required in strings
   under some circumstances.


Blanchet                 Expires January 6, 2011               [Page 10]

Internet-Draft              Precis Framework                   July 2010


9.9.  OldHangulJamo (I)

   I: Hangul_Syllable_Type(cp) is in {L, V, T}

   This category consists of all conjoining Hangul Jamo (Leading Jamo,
   Vowel Jamo, and Trailing Jamo).

   Elimination of conjoining Hangul Jamos from the set of PVALID
   characters results in restricting the set of Korean PVALID characters
   just to preformed, modern Hangul syllable characters.  Old Hangul
   syllables, which must be spelled with sequences of conjoining Hangul
   Jamos, are not PVALID for string classes.

9.10.  Unassigned (J)

   J: General_Category(cp) is in {Cn} and
      Noncharacter_Code_Point(cp) = False

   This category consists of code points in the Unicode character set
   that are not (yet) assigned.  It should be noted that Unicode
   distinguishes between 'unassigned code points' and 'unassigned
   characters'.  The unassigned code points are all but (Cn -
   Noncharacters), while the unassigned *characters* are all but (Cn +
   Cs).


10.  Calculation of the Derived Property

   Possible values of the property are:

   o  PVALID
   o  RI_PVALID
   o  LRI_PVALID
   o  CONTEXTJ
   o  CONTEXTO
   o  DISALLOWED
   o  RI_DISALLOWED
   o  LRI_DISALLOWED
   o  UNASSIGNED

   The algorithm to calculate the value of the derived property is as
   follows.  If the names of a rule (such as Exception) is used, that
   implies the set of codepoints that the rule define, while the same
   name as a function call (such as Exception(cp)) imply the value cp
   has in the Exceptions table.

   If .cp. .in.  Exceptions Then Exceptions(cp);
   Else If .cp. .in.  BackwardCompatible Then BackwardCompatible(cp);


Blanchet                 Expires January 6, 2011               [Page 11]

Internet-Draft              Precis Framework                   July 2010


   Else If .cp. .in.  Unassigned Then UNASSIGNED;
   Else If .cp. .in.  LDH Then PVALID;
   Else If .cp. .in.  JoinControl Then CONTEXTJ;
   Else If .cp. .in.  Unstable Then DISALLOWED;
   Else If .cp. .in.  IgnorableProperties Then DISALLOWED;
   Else If .cp. .in.  IgnorableBlocks Then LRI_PVALID;
   Else If .cp. .in.  OldHangulJamo Then DISALLOWED;
   Else If .cp. .in.  LetterDigits Then PVALID;
   Else DISALLOWED;


11.  Codepoints

   The Categories and Rules defined in Section 9 and Section 10 apply to
   all Unicode code points.  The table in Appendix B shows, for
   illustrative purposes, the consequences of the categories and
   classification rules, and the resulting property values.

   The list of code points that can be found in Appendix B is non-
   normative.  Section 9 and Section 10 are normative.


12.  IANA Considerations

12.1.  IDNA derived property value registry

   IANA is to create a registry with the derived properties for the
   versions of Unicode that is released after (and including) version
   5.2.  The derived property value is to be calculated in cooperation
   with a designated expert[RFC5226] according to the specifications in
   Section 9 and Section 10 and not by copying the non-normative table
   found in Appendix B.

   If during this process (creation of the table of derived property
   values) followed by a designated expert review, either non-backward
   compatible changes to the table of derived properties are discovered,
   or otherwise problems during the creation of the table arises, that
   is to be flagged to the IESG.  Changes to the rules (as specified in
   Section 9 and Section 10), including BackwardCompatible (Section 9.7)
   (a set that is at release of this document is empty), require IETF
   Review, as described in [RFC 5226].

12.2.  IDNA Context Registry

   For characters that are defined in IDNA derived property value
   registry (Section 12.1) as CONTEXTO or CONTEXTJ and therefore
   requiring a contextual rule IANA will create and maintain a list of
   approved contextual rules.  Additions or changes to these rules


Blanchet                 Expires January 6, 2011               [Page 12]

Internet-Draft              Precis Framework                   July 2010


   require IETF Review, as described in [RFC5226].

   A table from which that registry can be initialized, and some further
   discussion appears in Appendix A.

12.2.1.  Template for context registry

   The following information is to be given when a new rule is created.
      Name: Unique name of the rule
      Code point: Rule should be applied when this codepoint exist in
      label
      Overview: Description in plain english on what the rule verifies
      Lookup: Should rule be applied at time of lookup?
      Rule Set: The set of rules, as described in


13.  Security Considerations

   TBD


14.  Discussion home for this draft

   This document is discussed in the precis@ietf.org mailing list (This
   section to be removed when published as RFC).


15.  Acknowledgements

   The author of this document would like to acknowledge the comments
   and contributions of the following people: ...

   Since this document copies a lot of text and the algorithms from
   IDNAbis tables, therefore all authors and contributors to the idnabis
   work are deeply acknowledged.


Appendix A.  Contextual Rules Registry

   As discussed in Section 12.2, a registry of rules that define the
   contexts in which particular PROTOCOL-VALID characters, characters
   associated with a requirement for Contextual Information, are
   permitted.  These rules are expressed as tests on the label in which
   the characters appear (all, or any part of, the label may be tested).

   The grammatical rules are expressed in pseudo code.  The conventions
   used for that pseudo code are explained here.


Blanchet                 Expires January 6, 2011               [Page 13]

Internet-Draft              Precis Framework                   July 2010


   Each rule is constructed as a Boolean expression that evaluates to
   either True or False.  A simple "True;" or "False;" rule sets the
   default result value for the rule set.  Subsequent conditional rules
   that evaluate to True or False may re-set the result value.

   A special value "Undefined" is used to deal with any error
   conditions, such as an attempt to test a character before the start
   of a label or after the end of a label.  If any term of a rule
   evaluates to Undefined, further evaluation of the rule immediately
   terminates, as the result value of the rule will itself be Undefined.


      cp represents the codepoint to be tested.

      FirstChar is a special term which denotes the first codepoint in a
      string.

      LastChar is a special term which denotes the last codepoint in a
      string.

      .eq. represents the equality relation.

         A .eq.  B evaluates to True if A equals B.

      .is. represents checking position in a string.

         A .is.  B evaluates to True if A and B have same position in
         the same string.

      .ne. represents the non-equality relation.

         A .ne.  B evaluates to True if A is not equal to B.

      .in. represents the set inclusion relation.

         A .in.  B evaluates to True if A is a member of the set B.

   A functional notation, Function_Name(cp), is used to express either
   string positions within a string, Boolean character property tests of
   a codepoint, or a regular expression match.  When such function names
   refer to Boolean character property tests, the function names use the
   exact Unicode character property name for the property in question,
   and "cp" is evaluated as the Unicode value of the codepoint to be
   tested, rather than as its position in the string.  When such
   function names refer to string positions within a string, "cp" is
   evaluated as its position in the string.

   RegExpMatch(X) takes as its parameter X a schematic regular


Blanchet                 Expires January 6, 2011               [Page 14]

Internet-Draft              Precis Framework                   July 2010


   expression consisting of a mix of Unicode character property values
   and literal Unicode codepoints.

   Script(cp) returns the value of the Unicode Script property, as
   defined in Scripts.txt in the Unicode Character Database.

   Canonical_Combining_Class(cp) returns the value of the Unicode
   Canonical_Combining_Class property, as defined in UnicodeData.txt in
   the Unicode Character Database.

   Before(cp) returns the codepoint of the character immediately
   preceding cp in logical order in the string representing the string.
   Before(FirstChar) evaluates to Undefined.

   After(cp) returns the codepoint of the character immediately
   following cp in logical order in the string representing the string.
   After(LastChar) evaluates to Undefined.

   Note that "Before" and "After" do not refer to the visual display
   order of the character in a string, which may be reversed or
   otherwise modified by the bidirectional algorithm for strings
   including characters from scripts written right-to-left.  Instead,
   'Before' and 'After' refer to the network order of the character in
   the string.

   The clauses "Then True" and "Then False" imply exit from the pseudo-
   code routine with the corresponding result.

   Repeated evaluation for all characters in a string makes use of the
   special construct:

      For All Characters:
         Expression;
      End For;

   This construct requires repeated evaluation of "Expression" for each
   codepoint in the string, starting from FirstChar and proceeding to
   LastChar.

   The different fields in the rules are to be interpreted as follows:
   Code point:
      The codepoint, or codepoints, that this rule is to be applied to.
      Normally, this implies that if any of the codepoints in a string
      is as defined, then the rules should be applied.  If evaluated to
      True, the codepoint is ok as used; if evaluated to False, it is
      not o.k.


Blanchet                 Expires January 6, 2011               [Page 15]

Internet-Draft              Precis Framework                   July 2010


   Overview:
      A description of the goal with the rule, in plain English.
   Lookup:
      True if application of this rule is recommended at lookup time;
      False otherwise.
   Rule Set:
      The rule set itself, as described above.

Appendix A.1.  ZERO WIDTH NON-JOINER

   Code point:
      U+200C
   Overview:
      This may occur in a formally cursive script (such as Arabic) in a
      context where it breaks a cursive connection as required for
      orthographic rules, as in the Persian language, for example.  It
      also may occur in Indic scripts in a consonant conjunct context
      (immediately following a virama), to control required display of
      such conjuncts.
   Lookup:
      True
   Rule Set:
      False;
      If Canonical_Combining_Class(Before(cp)) .eq.  Virama Then True;
      If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
         (Joining_Type:T)*(Joining_Type:{R,D})) Then True;

Appendix A.2.  ZERO WIDTH JOINER

   Code point:
      U+200D
   Overview:
      This may occur in Indic scripts in a consonant conjunct context
      (immediately following a virama), to control required display of
      such conjuncts.
   Lookup:
      True
   Rule Set:
      False;
      If Canonical_Combining_Class(Before(cp)) .eq.  Virama Then True;

Appendix A.3.  MIDDLE DOT

   Code point:
      U+00B7


Blanchet                 Expires January 6, 2011               [Page 16]

Internet-Draft              Precis Framework                   July 2010


   Overview:
      Between 'l' (U+006C) characters only, used to permit the Catalan
      character ela geminada to be expressed
   Lookup:
      False
   Rule Set:
      False;
      If Before(cp) .eq.  U+006C And
         After(cp) .eq.  U+006C Then True;

Appendix A.4.  GREEK LOWER NUMERAL SIGN (KERAIA)

   Code point:
      U+0375
   Overview:
      The script of the following character MUST be Greek.
   Lookup:
      False
   Rule Set:
      False;
      If Script(After(cp)) .eq.  Greek Then True;

Appendix A.5.  HEBREW PUNCTUATION GERESH

   Code point:
      U+05F3
   Overview:
      The script of the preceding character MUST be Hebrew.
   Lookup:
      False
   Rule Set:
      False;
      If Script(Before(cp)) .eq.  Hebrew Then True;

Appendix A.6.  HEBREW PUNCTUATION GERSHAYIM

   Code point:
      U+05F4
   Overview:
      The script of the preceding character MUST be Hebrew.
   Lookup:
      False
   Rule Set:
      False;


Blanchet                 Expires January 6, 2011               [Page 17]

Internet-Draft              Precis Framework                   July 2010


      If Script(Before(cp)) .eq.  Hebrew Then True;

Appendix A.7.  KATAKANA MIDDLE DOT

   Code point:
      U+30FB
   Overview:
      Note that the Script of Katakana Middle Dot is not any of
      "Hiragana", "Katakana" or "Han".  The effect of this rule is to
      require at least one character in the label to be in one of those
      scripts.
   Lookup:
      False
   Rule Set:
      False;
      For All Characters:
         If Script(cp) .in. {Hiragana, Katakana, Han} Then True;
      End For;

Appendix A.8.  ARABIC-INDIC DIGITS

   Code point:
      0660..0669
   Overview:
      Can not be mixed with Extended Arabic-Indic Digits.
   Lookup:
      False
   Rule Set:
      True;
      For All Characters:
         If cp .in. 06F0..06F9 Then False;
      End For;

Appendix A.9.  EXTENDED ARABIC-INDIC DIGITS

   Code point:
      06F0..06F9
   Overview:
      Can not be mixed with Arabic-Indic Digits.
   Lookup:
      False
   Rule Set:
      True;
      For All Characters:


Blanchet                 Expires January 6, 2011               [Page 18]

Internet-Draft              Precis Framework                   July 2010


         If cp .in. 0660..0669 Then False;
      End For;


Appendix B.  Codepoints 0x0000 - 0x10FFFF

   If one applies the rules (Section 10) to the code points 0x0000 to
   0x10FFFF to Unicode 5.2, the result is as follows.

   This list is non-normative, and only included for illustrative
   purposes.  Specifically, what is displayed in the third column is not
   the formal name of the codepoint (as defined in section 4.8 of
   [Unicode52]).  The differences exists for example for the codepoints
   that have the codepoint value as part of the name (example: CJK
   UNIFIED IDEOGRAPH-4E00) and the naming of Hangul syllables.  For many
   codepoints, what you see is the official name.

Appendix B.1.  Codepoints in Unicode Character Database (UCD) format

   0000..10FFFF; TBD!


16.  Informative References

   [RFC3454]  Hoffman, P. and M. Blanchet, "Preparation of
              Internationalized Strings ("stringprep")", RFC 3454,
              December 2002.

   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
              "Internationalizing Domain Names in Applications (IDNA)",
              RFC 3490, March 2003.

   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
              Profile for Internationalized Domain Names (IDN)",
              RFC 3491, March 2003.

   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
              for Internationalized Domain Names in Applications
              (IDNA)", RFC 3492, March 2003.

   [RFC3722]  Bakke, M., "String Profile for Internet Small Computer
              Systems Interface (iSCSI) Names", RFC 3722, April 2004.

   [RFC3920]  Saint-Andre, P., Ed., "Extensible Messaging and Presence
              Protocol (XMPP): Core", RFC 3920, October 2004.

   [RFC4011]  Waldbusser, S., Saperia, J., and T. Hongal, "Policy Based
              Management MIB", RFC 4011, March 2005.


Blanchet                 Expires January 6, 2011               [Page 19]

Internet-Draft              Precis Framework                   July 2010


   [RFC4013]  Zeilenga, K., "SASLprep: Stringprep Profile for User Names
              and Passwords", RFC 4013, February 2005.

   [RFC4505]  Zeilenga, K., "Anonymous Simple Authentication and
              Security Layer (SASL) Mechanism", RFC 4505, June 2006.

   [RFC4518]  Zeilenga, K., "Lightweight Directory Access Protocol
              (LDAP): Internationalized String Preparation", RFC 4518,
              June 2006.

   [RFC4690]  Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and
              Recommendations for Internationalized Domain Names
              (IDNs)", RFC 4690, September 2006.

   [1]  <http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt>

   [2]  <http://unicode.org/Public/UNIDATA/Blocks.txt>


Author's Address

   Marc Blanchet
   Viagenie
   2600 boul. Laurier, suite 625
   Quebec, QC  G1V 4W1
   Canada

   Email: Marc.Blanchet@viagenie.ca
   URI:   http://www.viagenie.ca


Blanchet                 Expires January 6, 2011               [Page 20]