Internet Engineering Task Force M. Davis Internet-Draft Google Intended status: Informational A. Phillips Expires: March 28, 2012 Lab126 Y. Umaoka IBM C. Falk Infinite Automata September 25, 2011 BCP 47 Extension T - Transformed Content draft-davis-t-langtag-ext-06 Abstract This document specifies an Extension to BCP 47 which provides subtags for specifying the source language or script of transformed content, including content that has been transliterated, transcribed, or translated, or in some other way influenced by the source. It also provides for additional information used for identification. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on March 28, 2012. Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents Davis, et al. Expires March 28, 2012 [Page 1] Internet-Draft BCP 47 Extension T September 2011 carefully, as they describe your rights and restrictions with respect to this document. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 2. BCP47 Required Information . . . . . . . . . . . . . . . . . . 4 2.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2. Structure . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3. Canonicalization . . . . . . . . . . . . . . . . . . . . . 7 2.4. BCP47 Registration Form . . . . . . . . . . . . . . . . . 8 2.5. Field Definitions . . . . . . . . . . . . . . . . . . . . 8 2.6. Registration of Field Subtags . . . . . . . . . . . . . . 10 2.7. Registration of Additional Fields . . . . . . . . . . . . 10 2.8. Committee Responses to Registration Proposals . . . . . . 10 2.9. Machine-Readable Data . . . . . . . . . . . . . . . . . . 11 3. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 5. Security Considerations . . . . . . . . . . . . . . . . . . . 14 6. References . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6.1. Normative References . . . . . . . . . . . . . . . . . . . 14 6.2. Informative References . . . . . . . . . . . . . . . . . . 14 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 15 Davis, et al. Expires March 28, 2012 [Page 2] Internet-Draft BCP 47 Extension T September 2011 1. Introduction [BCP47] permits the definition and registration of language tag extensions "that contain a language component and are compatible with applications that understand language tags". This document defines an extension for specifying the source of content that has been transformed, including text that has been transliterated, transcribed, or translated, or in some other way influenced by the source. It may be used in queries to request content that has been transformed. The "singleton" identifier for this extension is 't'. Language tags, as defined by [BCP47], are useful for identifying the language of content. There are mechanisms for specifying variant subtags for special purposes. However, these variants are insufficient for specifying content that has undergone transformations, including content that has been transliterated, transcribed, or translated. The correct interpretation of the content may depend upon knowledge of the conventions used for the transformation. Suppose that Italian or Russian cities on a map are transcribed for Japanese users. Each name needs to be transliterated into katakana using rules appropriate for the specific source and target language. When tagging such data, it is important to be able to indicate not only the resulting content language ("ja" in this case), but also the source language. Transforms such as transliterations may vary depending not only on the basis of the source and target script, but also on the source and target language. Thus the Russian (which corresponds to the Cyrillic ) transliterates into "Putin" in English but "Poutine" in French. The identifier could be used to indicate a desired mechanical transformation in an API, or could be used to tag data that has been converted (mechanically or by hand) according to a transliteration method. In addition, many different conventions have arisen for how to transform text, even between the same languages and scripts. For example, "Gaddafi" is commonly transliterated from Arabic to English as any of (G/Q/K/Kh)a(d/dh/dd/dhdh/th/zz)af(i/y). Some examples of standardized conventions used for transcribing or transliterating text include: a. United Nations Group of Experts on Geographical Names (UNGEGN) b. US Library of Congress (LOC) Davis, et al. Expires March 28, 2012 [Page 3] Internet-Draft BCP 47 Extension T September 2011 c. US Board on Geographic Names (BGN) d. Korean Ministry of Culture, Sports and Tourism (MCST) e. International Organization for Standardization (ISO) The usage of this extension is not limited to formal transformations, and may include other instances where the content is in some other way influenced by the source. For example, this extension could be used to designate a request for a speech recognizer that is tailored specifically for 2nd-language speakers who are 1st-language speakers of a particular language (e.g. a recognizer for "English spoken with a Chinese accent"). 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. 2. BCP47 Required Information 2.1. Overview Identification of transformed content can be done using the 't' extension defined in this document. This extension is formed by the 't' singleton followed by a sequence of subtags that would form a language tag as defined by [BCP47]. This allows for the source language or script to be specified to the degree of precision required. There are restrictions on the sequence of subtags. They MUST form a regular, valid, canonical language tag, and MUST neither include extensions nor private use sequences introduced by the singleton 'x'. Where only the script is relevant (such as identifying a script-script transliteration) then 'und' is used for the primary language subtag. For example: +---------------------+---------------------------------------------+ | Language Tag | Description | +---------------------+---------------------------------------------+ | ja-t-it | The content is Japanese, transformed from | | | Italian. | | ja-Kana-t-it | The content is Japanese Katakana, | | | transformed from Italian. | Davis, et al. Expires March 28, 2012 [Page 4] Internet-Draft BCP 47 Extension T September 2011 | und-Latn-t-und-cyrl | The content is in the Latin script, | | | transformed from the Cyrillic script. | +---------------------+---------------------------------------------+ Note that the sequence of subtags governed by 't' cannot contain a singleton (a single-character subtag), because that would start a new extension. For example, the tag "ja-t-i-ami" does not indicate that the source is in "i-ami", because "i-ami" is not a regular language tag in [BCP47]. That tag would express an empty 't' extension followed by an 'i' extension. The t extension is not intended for use in structured data that already provides separate source and target language identifiers. For example, this is the case in localization interchange formats such as XLIFF. In such cases, it would be inappropriate to use "ja- t-it" for the target language tag because the source language tag "it" would already be present in the data. Instead one would use the language tag "ja". As noted earlier, it is sometimes necessary to indicate additional information about a transformation. This additional information is optionally supplied after the source in a series of one or more fields, where each field consists of a field separator subtag followed by one or more non-separator subtags. Each field separator subtag consists of a single letter followed by a single digit. A transformation mechanism is an optional field that indicates the specification used for the transformation, such as "UNGEGN" for the the United Nations Group of Experts on Geographical Names transliterations and transcriptions. It uses the 'm0' field separator followed by certain subtags. For example: +------------------------------------+------------------------------+ | Language Tag | Description | +------------------------------------+------------------------------+ | und-Cyrl-t-und-latn-m0-ungegn-2007 | the content is in Cyrillic, | | | transformed from Latn, | | | according to a UNGEGN | | | specification dated 2007. | +------------------------------------+------------------------------+ The field separator subtags such as 'm0' were chosen because they are short, visually distinctive, and cannot occur in a language subtag (outside of an extension and after 'x'), thus eliminating the potential for collision or confusion with the source language tag. Davis, et al. Expires March 28, 2012 [Page 5] Internet-Draft BCP 47 Extension T September 2011 The field subtags are defined by Section 3 [1] of Unicode Technical Standard #35: Unicode Locale Data Markup Language [UTS35]. As required by BCP 47, subtags follow the language tag ABNF and other rules for the formation of language tags and subtags, are restricted to the ASCII letters and digits, are not case sensitive, and do not exceed eight characters in length. EDITORIAL NOTE: This new facility has been accepted by the Unicode CLDR committee for incorporation into the next version of Unicode CLDR, parallel with the structure of the 'u' extension [RFC6067], for which it is already the maintaining authority. The data and specification will be available by the time this internet draft has been approved. LDML is available over the Internet and at no cost, and is available via a royalty-free license at http://unicode.org/copyright.html. LDML is versioned, and each version of LDML is numbered, dated, and stable. Extension subtags, once defined by LDML, are never retracted or substantially changed in meaning. The maintaining authority for the 't' extension is the Unicode Consortium: +---------------+---------------------------------------------------+ | Item | Value | +---------------+---------------------------------------------------+ | Name | Unicode Consortium | | Contact Email | cldr-contact@unicode.org | | Discussion | cldr-users@unicode.org | | List Email | | | URL Location | cldr.unicode.org | | Specification | Unicode Technical Standard #35 Unicode Locale | | | Data Markup Language (LDML), | | | http://unicode.org/reports/tr35/ | | Section | Section 3 Unicode Language and Locale Identifiers | +---------------+---------------------------------------------------+ 2.2. Structure The subtags in the 't' extension are of the following form: Davis, et al. Expires March 28, 2012 [Page 6] Internet-Draft BCP 47 Extension T September 2011 t-ext= "t" ; Extension (("-" lang *("-" field)) ; Source + optional field(s) / 1*("-" field)) ; Field(s) only (no source) lang= language ; BCP47, with restrictions ["-" script] ["-" region] *("-" variant) field= sep 1*("-" 3*8alphanum) ; With restrictions sep= ALPHA DIGIT ; Subtag separators alphanum= ALPHA / DIGIT where ,