Those Troublesome Characters: A Registry of Unicode Code Points Needing Special Consieration When Used in Network Identifiers
draft-freytag-troublesome-characters-00

Abstract

Unicode's design goal is to be the universal character set for all applications. The goal entails the inclusion of very large numbers of characters. The sheer size of the repertoire increases the possibility of accidental or intentional use of characters that can cause confusion among users, particularly where linguistic context is ambiguous, unavailable, or impossible to determine. A registry of code points that can be sometimes especially problematic may be useful to guide system administrators in setting parameters for allowable code points in an identifier system, and to aid applications in creating security aids for users.

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Unicode code points and identifiers

Unicode [CREF1]ajs: reference goes here; references mostly careless in this draft is a coded character set that aims to support every writing system. Writing systems evolve over time, and are sometimes influenced by one another. As a result, Unicode encodes many characters that, to a reader, appear to be the same thing; but that are encoded differently from one another. This sort of difference is usually not important in written texts, because competent readers and writers of a language are able to compensate for the selection of the "wrong" character when reading or writing.

Identifiers that are used in a network or, especially, an Internet context present three special problems because of the above feature of Unicode:

2. Techniques already in place

In the IDNA mechanism for including Unicode code points [RFC5892], a code point is only included when it meets the needs of internationalizing domain names as explained in the IDNA framework [RFC5894]. For identifiers beyond IDNA, the PRECIS framework [RFC7564] generalizes the same basic technique. In both cases, the overall approach is to assume that all characters are excluded, and then include characters according to properties derived from the Unicode character properties. This general strategy cuts the enormous size of the Unicode database somewhat, avoiding including some characters that are necessarily unsuited for use as identifiers.

The mechanism of inclusion by derived property, however, is insufficient to guarantee every included character is safe for use in identifiers. Some characters' properties lead them to be included even though they are not obviously good candidates. In other cases, indvidual characters are good for inclusion, but are problematic in combination. Finally, there are cases where a two characters or sequences are not problematic by themselves, or if used in alternation in the same identifier, but become problematic when their choice represents the only difference between otherwise identical identifiers. [CREF2]ajs: Do we want examples here?

Operators of systems that create identifiers (whether through a registry or through peer-to-peer identifier negotiation system) need to make policies for characters they will permit. Operators of registries, for instance, can help by adopting good registration policies: "Users will benefit if registries only permit characters from scripts that are well-understood by the registry or its advisers."[RFC5894] The difficulty for many operators, however, is that they do not have the writing system expertise to claim any character is "well-understood", and they do not really have the time to develop that expertise.

To help with the foregoing, a registry of Unicode code points that present special issues for network identifiers can help guide protocol and operating decisions about whether to permit a given code point or sequence of code points.

In the case of registries, it is not always necessary or desirable to exclude characters so much as to guarantee that they are used in a strictly mutually exclusive way in otherwise identical identifiers.

3. A registry of code points

3.1. Discussion

The registry contains three fields. The first field, called "Code Point(s)", is a code point or sequence of code points. The second, contains zero or more cross references to related code points. The third, called "Explanation", is a free form text field that describes briefly the issue. Long paragraphs are discouraged; a code point that needs such discussion should be discussed in a document somewhere. The explanation field may contain references to documents, so long as the reference is stable.

The registry is updated by Expert Review. It ought to contain only code points that are significant in identifiers and that need special policies (including policies of exclusion).

3.2. Registry initial contents

Registry of Unicode Code Points for Special Consideration in Network Identifiers
Code Point(s)	Cross Reference	Explanation
U+02BC	U+2019	Character is indistinguishable from a common punctuation mark
U+0338		Not intended for use in creating letters
U+0259	U+01DD	Phonetic character

4. IANA Considerations

The IANA Services Operator is hereby requested to create the Registry of Unicode Code Points for Special Consideration in Network Identifiers, and to populate it with the values in section Section 3.2. The registry is to be updated by Expert Review.

5. Informative References

[RFC5892]	Faltstrom, P., "The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)", RFC 5892, DOI 10.17487/RFC5892, August 2010.
[RFC5894]	Klensin, J., "Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale", RFC 5894, DOI 10.17487/RFC5894, August 2010.
[RFC7564]	Saint-Andre, P. and M. Blanchet, "PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols", RFC 7564, DOI 10.17487/RFC7564, May 2015.

Appendix A. Discussion Venue

This Internet-Draft may be discussed on the IAB Internationalization public list: i18n-discuss@iab.org.

Appendix B. Change History

Note to RFC Editor: this section should be removed prior to publication as an RFC.

00:

Initial version

Authors' Addresses

Asmus Freytag ASMUS, Inc. EMail: asmus@unicode.org

John C Klensin 1770 Massachusetts Ave, Ste 322 Cambridge, MA 02140 U.S.A. EMail: john-ietf@jck.com

Andrew Sullivan Dyn, Inc. 150 Dow St Manchester, NH 03101 U.S.A. EMail: asullivan@dyn.com