Internet Draft M. Duerst University of Zurich Expires January 1998 July 1997 Handling Internationalized Query Components in URLs Status of this Memo This document is an Internet-Draft. Internet-Drafts are working doc- uments of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute work- ing documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months. Internet-Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet- Drafts as reference material or to cite them other than as a "working draft" or "work in progress". To learn the current status of any Internet-Draft, please check the 1id-abstracts.txt listing contained in the Internet-Drafts Shadow Directories on ds.internic.net (US East Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim). Distribution of this document is unlimited. Please send comments to the author at or to the uri mailing list at uri@bunyip.com. This document is currently a pre-draft, for restricted discussion only. It is intended to become part of a suite of documents related to the internationalization of URLs. Abstract HTTP and HTML provide the facility to query the user and return the results. This is usually done in the query component of an URL. This mechanisms works with full satisfaction for characters of the us- ascii repertoire. Due to the lack of an agreed encoding for other characters, the situation is much less satisfactory for characters outside the us-ascii repertoire. This document makes two contributions to the problem: (1) It describes an application convention mostly already respected, and sufficient in many cases. (2) It introduces an addition to HTTP to ease the transition to a general internationalized URL architecture. Expires End of January 1998 [Page 1] Internet Draft Internationalized Query Components July 1997 Table of contents 1. Introduction ................................................... 2 1.1 General ......................................................2 1.2 Terms ........................................................3 2. A Simple Application Convention for Browsers ....................3 4. Upgrading of Query Component to UTF-8 ...........................4 3.1 The Query-UTF-8 Request/Response-Header Field ................4 3.2 Rationale ....................................................5 Bibliography .......................................................6 Author's Address ...................................................7 1. Introduction 1.1 General HTTP (HyperText Transfer Protocol [HTTP1.1]) and HTML (HyperText Markup Language [HTML4.0]) provide the facility to query the user (with a FORM in HTML) and return the results to the server. There are various ways to return the result (see in particular [Fileupload]), but the one most widely used is to encode the result in the query component of an URL [RFC1738, URLsyntax]. This mechanisms work with full satisfaction for characters of the us-ascii repertoire. Due to the lack of an agreed encoding for other characters, the situation is much less satisfactory for characters outside the us-ascii reper- toire. Ideally, the problem would be solved by agreeing on a single charac- ter encoding for all query parts or all URLs. The outstanding candi- date for this is UTF-8 [RFC2044]. UTF-8 is already the preferred encoding for new URL schemes [URLprocess], the only encoding for a recently defined URL scheme [IMAPURL], the encoding on the wire for beyond-ASCII FTP filenames [FTPINT] (thus making it the encoding for the ftp: URL scheme) and the encoding suggested for the Internet in general [RFC2130]. UTF-8 has various important properties, in par- ticular that it is completely compatible with US-ASCII and is easily detectable by simple heuristics. Moving to UTF-8 for URLs is most difficult for the query component. This is due to the fact that for the other components, in particular for the path component, the namespace is very sparse and well known to the server, while it is dense and not well known in the case of the query part. To increase the reliability of transmitting query information, this document describes an existing convention and Expires End of January 1998 [Page 2] Internet Draft Internationalized Query Components July 1997 proposes some new protocol element for HTTP. 1.2 Terms This section contains definitions and explanations for some terms that may otherwise not be clear. - Accept-Charset attribute: An HTML attribute, proposed in [RFC2070] and taken up in HTML 4.0 [HTML4.0]. Please note that this is not the same as the Accept-Charset request-header field in HTTP. Please also note that the Accept-Charset attribute is on INPUT and TEXTAREA in RFC 2070, but on FORM in HTML 4.0. The HTML 4.0 syntax is preferred, and assumed in this document. - CGI Script: In the context of this document, a placeholder for any kind of functional component used to process a response to a query. - Character Encoding: A mapping from an octet sequence to a sequence of characters. Misleadingly called "character set" in some IETF documents [RFC2045]. Denoted by the value of the "charset" pra- mater, with values from the corresponding IANA registry [IANA]. - Transcoding Server/Proxy: A HTTP Server or Proxy which transcodes the documents it serves, to respond to an "Accept-Charset" HTTP request header field. - Transcoding: The act of changing the character encoding of a docu- ment, while not changing it otherwise (the length of the document may be affected). 2. A Simple Application Convention for Browsers This section spells out an application convention that is in use in most current and older browsers, although it is not followed, or not completely followded, by all browsers, and that can be implemented easily. The convention is that a user agent should send back the results of a query in exactly the same character encoding as the character encod- ing of the document that contained the FORM, as received by the user agent. Expires End of January 1998 [Page 3] Internet Draft Internationalized Query Components July 1997 The advantage of this application convention is that it works nicely for documents and CGI scripts that are assuming a single character encoding. In the plain case, neither the server nor the CGI script have to do any special processing such as trying to detect the char- acter encoding of the query component or transcode the query compo- nent. This application convention fails if the document has been transcoded by a transcoding proxy. The query compontent is sent back in the character encoding requested by the user agent, which is the target character encoding of the transcoding undergone at the proxy. The query component sent back to the server, however, must not be changed by the proxy (see [HTTP1.1]). 3. Upgrading of Query Component to UTF-8 For those parts of an URL that originate at the server, in particular for the path component, the introduction of UTF-8 [RFC2044] as the encoding of choice can be made on a per-server or per-resource base. Because the name space of the path component is usually very sparsely populated, it is even possible to accept URLs with path components in different character encodings for the same resource. The query component of an URL, however, is in most cases generated independently in the user agent, and the namespace can be very densely populated. To upgrade it to UTF-8 therefore requires addi- tional provisions. Here, we propose to add a single header field to HTTP. The header field is used both as a request header field and as a response header field. 3.1 The Query-UTF-8 Request/Response-Header Field The syntax of the QUERY-UTF-8 request/response-header field is defined as follows: query-utf-8 = "Query-UTF-8" ":" ( "Yes" | "No" ) Both "Yes" and "No" above are case insensitive. I.e. "Yes" as well as "yes" or "yES", and so on, are acceptable. As a response-header field (sent from the server to the client), the field indicates whether the user agent can send back the query compo- nent encoded as UTF-8 or not. If the value is "Yes", and the scheme component and site component of the URL of the document containing Expires End of January 1998 [Page 4] Internet Draft Internationalized Query Components July 1997 the FORM and the URL given for query submission are identical, the query component SHOULD be sent back encoded as UTF-8. If the value is "No", and the FORM does not have an Accept-charset attribute that contains the "charset" parameter value "UTF-8", then the query compo- nent MUST be sent back according to the application convention described in Section 3, or in some other way by older browsers. As a request-header field (sent from the client to the server; the term request-header field is somewhat misleading here), the field indicates whether the query component is encoded as UTF-8. A Query- UTF-8 request-header field MUST be sent back when the following con- ditions are all met: - The URL sent back contains a query compontent - The document containing the FORM is received with a Query-UTF-8 response-header field with value "Yes" or the Accept-Charset attribute of the FORM contains the charset parameter value of "UTF-8". - The client recognizes the corresponding syntax. (The intention of the last sentence is to be able to phase out Query-UTF-8 after a transitory period.) 3.2 Rationale The availability of both the Accept-charset attribute on FORM and the Query-UTF-8 response-header field may seem unnecessary. The rationale for this is to allow two modes of operation, called server-driven and script-driven. In script-driven mode, the CGI script handles character encoding negotiation and identification. Typically, the author of a FORM docu- ment and the corresponding CGI script will use the Accept-charset attribute on FORM with the value "UTF-8" to tell the client to send back data in UTF-8. It will then check for the presence and value of the Query-UTF-8 request-header field in the response from the client, and make conversions if necessary. In server-driven mode, the character encoding that a CGI scripts expects to receive is registered with the server in a similar way as the character encodings of documents (including those generated by CGI scripts) are registered. A server offering such a functionality adds the Query-UTF-8 response-header field with value "Yes" to outgo- ing documents containing FORMs, and converts from UTF-8 back to the encoding the CGI script is expecting when a query arrives with Expires End of January 1998 [Page 5] Internet Draft Internationalized Query Components July 1997 "Query-UTF-8: Yes". The distinction between script-driven and server-driven mode is not made based on whether Query-UTF-8 or the Accept-Charset attribute are used. Both features are provided because it is easier for a document author to use Accept-Charset, and easier for a server to add Query- UTF-8. Also, because a server does not know about the facilities available on other servers, "Query-UTF-8: Yes" sent from the server to the client is only valid if the query result is sent back to the same server. For query results sent to other servers, the Accept- Charset attribute must be used. Acknowledgements I am grateful in particular to the following persons for their help and/or criticism: Roy Fielding, Eric van der Poel, Francois Yergeau, Gavin Nicol, Frank Tang, Larry Masenter, and Tim Greenwood. Bibliography [Fileupload] E. Nebel and L. Masinter, "Form-based File Upload in HTML", draft-ietf-html-fileupload-03.txt, August 1995. [FTPINT] B. Curtin, "Internationalization of the File Transfer Protocol", draft-ietf-ftpext-intl-ftp-02.txt, June 1997. [HTTP1.1] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2068, January 1997. [HTML4.0] D. Raggett, A. Le Hors, and I. Jacobs, "HTML 4.0 Spec- ification", http://www.w3.org/TR/WD-html40/, July 1997. [IMAPURL] Ch. Newman, "IMAP URL Scheme", draft-newman-url- imap-10.txt, July 1997. [RFC1738] T. Berners-Lee, L. Masinter, and M. McCahill, "Uniform Resource Locators (URL)", CERN, Dec. 1994. [RFC2044] F. Yergeau, "UTF-8, A Transformation Format of Unicode and ISO 10646", Alis Technologies, October 1996. Expires End of January 1998 [Page 6] Internet Draft Internationalized Query Components July 1997 [RFC2045] N. Freed, N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", November 1996. [RFC2070] F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Inter- nationalization of the Hypertext Markup Language", RFC 2070, January 1997 (Note: This RFC is currently being updated to reference Unicode 2.0 and ISO 10646 includ- ing AM-5. The new definition of UTF-8 should be used). [RFC2130] C. Weider C. Preston, K. Simonsen, H. Alvestrand, R. Atkinson, M. Crispin, P. Svanberg, "The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996", April 1997. [URLprocess] L. Masinter, D. Zigmond and H. Alvestrand, "Guidelines and Process for new URL Schemes", draft-masinter-url- process-01.txt, March 1997. [URLsyntax] T. Berners-Lee, R. Fielding, L. Masinter, "Uniform Resource Locators (URL): Generic Syntax and Seman- tics", draft-fielding-url-syntax-05.txt, May 1997. Author's Address Martin J. Duerst Multimedia-Laboratory Department of Computer Science University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland Tel: +41 1 257 43 16 Fax: +41 1 363 00 35 E-mail: mduerst@ifi.unizh.ch NOTE -- Please write the author's name with u-Umlaut wherever possible, e.g. in HTML as Dürst. Expires End of January 1998 [Page 7]