Internet-Draft M.T. Carrasco Benitez EMEA Expires 29 February 1999 1 September 1999 Xdossier Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This is an informational memo for Xdossier. A Xdossier is a data object designed for browsing with web browsers and mappable to XML. It is based on a directory structure containing files in several formats. Table of Contents 1. Introduction 2. Rationale 3. Terminology 4. Name 5. Representation 6. File extension 7. File formats 8. Character sets 9. Web Formats 10. Directory Index 11. Root directory 12. Well-formed and valid Xdossier 13. Xdossier DTD 13.1. By-Example Xdossier DTD 13.2. Syntactic Xdossier DTD 14. Self-containness 15. Compound Xdossier 16. Mapping 17. HTML for Index 18. References 19. Author 19.1. Disclaimer 1. Introduction It is recommended to play with a Xdossier example, as this memo should be easier to understand. For examples look in http://xdossier.com. This recommendation is about organising files. They are organised into a data object called Xdossier. Informally, a Xdossier is a directory structure with files in several formats created for web browsing; direct browsing ("file:") or served browsing ("http:"). Classifying files within directories is easy and very instinctive. A few HTML files with some descriptions and links can greatly help the browsing and give a feel of "oneness". One can easily start organising using the directory structure point of view. By following a few rules, one can end up with a data object easy to browse and with a significant structure. A directory structure is a tree similar to an XML document. There is a strong parallelism: directory structure XML ------------------- --- root directory document element/document entity directory element file entity directory name element name file name entity reference content XML file parsed entity content of non XML file unparsed entities With a formal mapping to XML, the directory structure could be transformed into an XML document. A strategy could be to start with the (main) "tree" and to progress with the organisation towards the content of the individual files (the "leaves"): a few files could be XML files, eventually the whole Xdossier should be transformable into a XML document. This approach is particularly useful to organise large amount of legacy data in several formats for which there is no clear formal definition. 2. Rationale - Usable with web browsers. - Easy to "produce" and easy to "consume". - Usable "as is" and adapted to further processing. For example, a CD- ROM must be usable directly ("raw" consumption) and programs should be capable of mechanical processing to load into a DBMS, web server, etc. - Easy to prepare with resources (computer equipment, programs, staff, etc) in most firms or acquirable at low cost. In particular, it should be easy to prepare by hand without the need of special programs. - Mappable to XML. - Vendor independent. - Usable as an interface to exchange data. 3. Terminology The specific terms to this memo have usually the first character of each token in capital. - By-Example Xdossier DTD: A type of Xdossier DTD. - By-Example DTD: Abbreviation of "By-Example Xdossier DTD". - Directory Index: File, usually named "index.html", that contain links to and information on files in a particular directory. - Xdossier: (1) The concept as described in this memo. (2) Abbreviation of "Xdossier Instance". - Xdossier Instance: Parallel meaning with XML document instance. - Xdossier Skeleton: A type of Xdossier DTD. - Xdossier Table of Contents: The Xdossier in the root directory. The Xdossier Table of Contents must allow the navigation of the whole Xdossier. Typically, there would be links to other Directory Indexes. - Index: Abbreviation of "Directory Index". - Instance: Abbreviation of "Xdossier Instance". - Skeleton: Abbreviation of "Xdossier Skeleton". - Table of Contents: Abbreviation of "Xdossier Table of Contents". 4. Name Xdossier that do not conform to this section are "Non-Naming Conformant". Though all Xdossier must conform at least to the naming in XML [XML]. "Name" is a token composed of the following characters: - Letters "a" to "z"; i.e., lower case only; [U+0061 to U+007A]. - Digits "0" to "9" [U+0030 to U+0039]. - "-" [HYPHEN-MINUS, U+002D]. - "_" [LOW LINE, U+005F]. The notation "U+" refers to the Unicode [UNICODE] notation. Correct Names part_a part-b myfile hello Incorrect Names part a (' ' ; SPACE is not allowed) Myfile (capitals are not allowed) myfile.xml ('.' ; FULL STOP is not allowed) hello:html (':' ; COLON is not allowed) "Directory Name" is a Name. "File Name" is one Name followed by one or more Name(s) separated by a '.' (FULL STOP, U+002E). Correct File Names a_part myfile.html hello.en.xml hello.en.xml.gz Incorrect Names a part (' ' ; SPACE is not allowed) Myfile.html (capitals are not allowed) hello:xml (':' ; COLON is not allowed) "Document Name" is the first Name in the File Name. Example, "docname" in the File Name "docname.ext" "File extension(s)" is/are the second and following Name(s). For example, "ext1", "ext2" and "ext3" in the File Name "docname.ext1.ext2.ext3" 5. Representation The same information could be represented in different fashions. The dimensions considered are: - Language; e.g., English, Spanish. - Media type; e.g., HTML, PDF. - Encoding; e.g., gzip, compress. 6. File extension File extensions are used to indicate representations. For example: hello no extension hello.html format HTML hello.en language English hello.gz compressed using "gzip" hello.en.html English in HTML hello.html.gz HTML, gziped hello.en.gz English, gziped hello.en.html.gz English, HTML, gziped File extensions, particularly the last one, are operating systems dependants: - Syntax: e.g., DOS allows up to three characters file extensions. - Association: which program is associated with the extension. The extension should correspond to widely used mapping between Internet Media Types [IMT] and file extensions. The examples above work for transparent content negotiation in Apache. Note the difference between "file" and "document". File refers to physical storage; e.g., "mydoc.txt" is a file. Document refers to content; e.g., "mydoc" is a document represented in the files "mydoc.txt" and "mydoc.html", they contain the same document in different formats. Another memo should address the syntax for file extensions. 7. Format Priority should be given to file formats (media types) with a good chance of being readable "forever"; e.g., in 50 years. This points to "neutral" formats: formal standard, industrial standard, vendor independent, "text-like", etc. One should not discard proprietary formats, as they could be the "source" format; i.e., the format in which the data was originally produced. Often, information is lost in format transformation. The recommendation is to include: - The source format. - At least one neutral format. - Indicate the method used in the format transformations; e.g. source format saved HTML using the "Save as" facility in such application. The file formats in order of preference are: - Text: XHTML, HTML, XML, text, RTF and PDF. - Graphic: JPEG, GIF and TIFF. Other formats could also be included. They should be widely used formats. [Relation to Xdossier DTD]: It could include a list of accepted formats in order of preference and different mapping between the Internet Media Type and file extensions. 8. Character encoding The character encoding ("charset") in order of preference are: - Unicode UTF-8, Unicode 16 bits [ISO10646]. - ISO-8859-1 (Latin-1) or appropriate ISO-8859-x; e.g., ISO-8859-7 for Greek. Other character encoding could also be used. They should be widely used character encoding. [Relation to Xdossier DTD]: It could include a list of accepted character encoding in order of preference. 9. Web Formats These are file formats well adapted to the web and widely supported in browsers; corollary: a very good format for the web, but not massively supported in browsers is not a Web Format. Web Format is a fuzzy moving definition. It is also "community dependent"; e.g., a certain community could consider XML a Web Format and another community could consider that it is not a Web Format. By default, the only Web Format is HTML. [Relation to Xdossier DTD]: It could redefine the list of Web Formats. 10. Directory Index Directory Index, abbreviated to Index, is a document in Web Format included in each directory. There is a closed association among the directory, the Index in the directory and the file(s). They refer to each other as "his". For example, "Index and his directory and files". Index should fulfil the dual function: - Browsing (informal view). - Metadata (formal view). For browsing, Index should have a description of his directory and meaningful labels with links to at least his file(s). It could also contain links to other files/resources. Links to files within a Xdossier must be relative. Whenever possible, links within a Xdossier should point to an Index rather than a file. For Metadata, Index should contain the metadata of his directory and files. The metadata should be machine processable. The browsing and metadata are functions. Syntactically, they could be interwoven. Syntactically, there are two types of Indexes: - Informal Index: it does not follow any particular syntax. - Formal Index: It follows a syntax. If Index is not present, the File Names in the directory should be meaningful. The default Document Name for Index is "index" and the default format is the default Web Format. Hence, at present the default File Name for Index is "index.html". Another memo should address the syntax for Formal Index. [Relation to Xdossier DTD]: It could redefine the default Index name. 11. Root directory The root directory must contain only one file, the Index; and zero or more directories. Corollary: The trivial Xdossier is composed on one Index. The intention for allowing only one file in the root directory is to make it obvious that the file present is the Table of Contents, as the other elements must be directories. 12. Well-formed and valid Xdossier Well-formed Xdossier is when it follows these recommendations. In particular, it does not need a Xdossier DTD. Valid Xdossier is when it is well-formed and in addition follows the restrictions in a Xdossier DTD (By-Example DTD or Syntactic DTD). 13. Xdossier DTD Xdossier DTD is needed for valid Xdossiers. There is two type of Xdossier DTDs: - "By-Example". - " Syntactic". 13.1. By-Example Xdossier DTD A Xdossier Instance could be a Xdossier DTD just by declaration that it is a Xdossier DTD; i.e., follow "this example". Probably, some aspects would be fuzzy. More realistically, a By-Example DTD should be an "Xdossier Skeleton"; i.e., purpose built example. Typically, the files in the Skeleton means that they must be present in the instantiations with the same name and format. Additional instructions should be in the Indexes; e.g. "such a file is optional". People with limited knowledge in computers could create By-Example DTDs, as it is instinctive. Probably, the path would be to create a well-formed Instance and them to proceed with creation of a Skeleton. As the approach does not have a fixed syntax, it is not intended for full mechanical validation by computer. Some parts would have to be validated by humans, though parts that follow a syntax could be validated mechanically. For example, the content model of the files/directories could be defined as: - DTD: an XML DTD. - Pair of values: for example a list of pair of values like "/food/choco/index.html=Documents about chocolate" 13.2. Syntactic Xdossier DTD This is needed to implement computer programs that could do full mechanical validation of Xdossiers. Another memo should address the syntax for Syntactic Xdossier DTD. 14. Self-containness There are three levels: - Absolute Xdossier: When all the resources are in the Xdossier. - Self-Contain Xdossier: When all "Essential Resources" are in the Xdossier. For example, the CSS is in the Xdossier, though there could be secondary references to other resources such as a reference to the W3C site at http://w3.org. At least this level should be attained. - Fragment Xdossier: When at least one "Essential Resource" is not in the Xdossier. For example, the CSS is not in the Xdossier and it relies in an external CSS such as the one in the W3C site at http://www.w3.org/StyleSheets/Core/. It is only recommended as a directory of Xdossier. Otherwise, there should an agreement between producers and consumers of the Xdossier. Essential Resources are the ones needed for navigation and display. [Relation to Xdossier DTD]: It could include the minimal level of Self-containness requested and a re-definition of the Essential Resources. 15. Compound Xdossier It is a Xdossier where all the directories in the root directories are Xdossier themselves. These directories could also be Compound Xdossiers and so on. [Relation to Xdossier DTD]: It could include require Compound Xdossier. 16. Mapping Mapping is for transforming between Xdossier and XML. Xdossier should be transformable into XML. Mapping: directory <-> element directory name <-> element name root directory <-> document element Index <-> attributes (for his directory and files) file <-> entity file name <-> entity name file content <-> entity reference XML file <-> Parsed entity Non-XML file <-> Unparsed entity 17. HTML for Index The HTML Indexes should follow the indications: - Simple mainstream HTML; i.e., facilities easy to write and that work in most browsers. - XHTML [XHTML] approach. For example, well formed documents, separate content from the presentation (e.g., CCS [CSS2]), etc. - It is recommended to use one CSS for all the Indexes. - A reasonable presentation with the most popular browsers (e.g. Internet Explorer, Navigator, etc) and text only browsers (e.g. Lynx). - Links that work when read directly (e.g., a CD-ROM inserted into a PC) or served by an HTTP server; i.e., "file:" or "http:". - Links that point directly to files, except when the intention is to show the content of the directory. One should not assume that Xdossier would be served by server; i.e., it should work directly ("file:") or served ("http:"). - No frames, scripts (e.g. JavaScript) and Java Applets. - Images (IMG) with alternative texts. - Relative links within the Xdossier; e.g. href="../doc.html". - Use language attributes (lang, xml:lang, etc), to indicate the language of the text. [Relation to Xdossier DTD]: It could change the HTML indications. In particular, it could include at CSS for the Indexes. 18. References [ALLEN] Package or Perish. Terry Allen Pages 385-390 in SGML/XML '97 Conference Proceedings. SGML/XML '97. [CML] Chemical Markup Language http://www.venus.co.uk/omf/ [CSS2] Cascading Style Sheets, level 2 http://www.w3.org/TR/REC-CSS2 [DC] Dublin Core http://purl.org/dc [ISO10646] Information Technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane, ISO/IEC 10646-1:1993 [HTML] HTML 4.0 Specification http://www.w3.org/TR/REC-html40 [IMT] Internet Media Types http://www.isi.edu/in-notes/iana/assignments/media-types/media-types [MHTML] The MIME Multipart/Related Content-type. E. Levinson ftp://ftp.ietf.org/rfc/rfc2387.txt [RDF] Resource Description Framework Model and Syntax Specification http://www.w3.org/TR/REC-rdf-syntax [SCHEMA1] XML Schema Part 1: Structures ("work in progress") http://www.w3.org/TR/xmlschema-1/ [Unicode] Unicode Consortium http://www.unicode.org [XDOSSIER-TRAN] Xdossier Transport ("work in progress") http://xdossier.com [XHTML] XHTML 1.0: The Extensible HyperText Markup Language ("work in progress") http://www.w3.org/TR/WD-html-in-xml [XML] Extensible Markup Language (XML) 1.0 http://www.w3.org/TR/rec-xml [XSL] Extensible Stylesheet Language Specification ("work in progress") http://www.w3.org/TR/WD-xsl 19. Author Manuel Tomas CARRASCO BENITEZ The European Agency for the Evaluation of Medicinal Products 7 Westferry Circus Canary Wharf London E14 4HB U.K. Telephone +44 171 418 86 45 carrasco@dragoman.org http://dragoman.org/carrasco 19.1. Disclaimer This memo represents only the view of the author.