Internet Engineering Task Force E. Zierau, Ed. Internet-Draft Royal Danish Library Intended status: Informational June 9, 2018 Expires: December 11, 2018 A Persistent Web IDentifier (PWID) URN Namespace draft-pwid-urn-specification-02 Abstract This document specifies a Uniform Resource Name (URN) for Persistent Web IDentifiers to web material in web archives using the 'pwid' namespace identifier. The purpose of the standard is to support general, global, sustainable, humanly readable, technology agnostic, persistent and precise web references for web materials in web archives in a way that can make them potentially resolvable. The PWID URN can assist in two ways: First, by providing potential resolvable precise and persistent reference scheme for web archive materials, which is not sufficiently covered by existing web reference practices and new suggested referencing methods. Second, to specify web elements in web collections (also known as web corpus) even for collections where there are references to web elements in several archives. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on December 11, 2018. Copyright Notice Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. Zierau Expires December 11, 2018 [Page 1] Internet-DraAtPersistent Web IDentifier (PWID) URN Namespace June 2018 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 2. Namespace Registration Template . . . . . . . . . . . . . . . 4 3. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 13 4. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1. Normative References . . . . . . . . . . . . . . . . . . 13 4.2. Informative References . . . . . . . . . . . . . . . . . 14 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 15 1. Introduction The purpose of the PWID URN is to represent general, global, sustainable, humanly readable, technology agnostic, persistent and precise web archive resource references in a way that; o can be used for technical solutions e.g. to make them resolvable o can cover references to all sorts of materials in web archives o can cover references to materials from all sort of web archives The motivation for defining a PWID namespace is the growing challenge of references to web resources, which the PWID as a URN can assist in overcoming. The standard is needed to address web materials meeting precision and persistency issues on par precision in with traditional references for analogue material. This regards both referencing of web resources from research papers and definition of web collection/ corpus. In detail the challenges are: o Citation guidelines generally do not cover general and persistent referencing techniques for web resources that are not registered by Persistent Identifier systems (like DOI [DOI]). However, an increasing number of references point to resources that only exist on the web, e.g. blogs that turned out to have a historical impact. In order to obtain persistency for a reference, the target need to be stable. As the live web is 'alive' and in Zierau Expires December 11, 2018 [Page 2] Internet-DraAtPersistent Web IDentifier (PWID) URN Namespace June 2018 constant change, persistency can only be obtained by referring to archived snapshots of the web. The PWID URN is therefore focused on referencing archived web material in a technology agnostic way (research documented in [IPRES] and [ResawRef]). o There are many new initiatives for web archive referencing, - most of them are centralised solutions which offers harvest and referencing, but these cannot be used for existing materials in web archives. Other initiatives only cover open web archives, which does not cover material in closed archives and where there is a risk of imprecision if a resource in an alternative archive is the result of resolving such a resource. The PWID URN is needed in order to fill these gaps where other techniques are not sufficient. o There are many different requirements for construction of collection definitions for web material besides precision and persistency. Recent research have found that various legal and sustainability issues leads to a need for a collection to be defined by references to the web parts in the collection. The PWID URN is needed in such definitions in order to fulfil these requirements and to enable a collection to cover web materials from more archives (Research documented in [ResawColl]). The PWID is especially useful for web material where precision is in focus and/or there are references to materials from closed web archives requiring special grants in order to gain access. The precision regards both regards precise reference where there can be no doubt about that you have the correct web material as well as precision about what is actually referred by the reference (e.g. is it the page or the whole website) Furthermore the PWID is very useful in specification of contents of a web collection (also known as web corpus). Definitions of web collections are often needed for extraction of data used in production of research results, e.g. for evaluations in the future. Current practices today are not persistent as they often use some CDX version, which vary for different implementations. For the sake of usability and sustainability, the definition of the PWID URN is focused on only having the minimum required information to make a precise identification of a resource in an arbitrary web archive. Resent research have found that this is obtain by the following information [ResawRef]: o Identification of web archive o Identification of source: Zierau Expires December 11, 2018 [Page 3] Internet-DraAtPersistent Web IDentifier (PWID) URN Namespace June 2018 * Archived URI or identifier * Archival timestamp o Intended coverage (page, part, subsite etc.) The PWID URN represents this information in an unambiguous way, and thus enabling technical solutions to be defined in this URN. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 2. Namespace Registration Template Namespace Identifier: PWID Version: 2 Date: 2018-06-09 Registrant: Eld Maj-Britt Olmuetz Zierau Royal Danish Library Soeren Kierkegaards Plads 1 1219 Copenhagen Denmark ph: +45 9132 4690 email: elzi@kb.dk Purpose: The purpose of the PWID URN is to represent general, global, sustainable, humanly readable, technology agnostic, persistent and precise web archive resource references in a way that: * can be used for technical solutions e.g. to make them resolvable Zierau Expires December 11, 2018 [Page 4] Internet-DraAtPersistent Web IDentifier (PWID) URN Namespace June 2018 * can cover references to all sorts of materials in web archives * can cover references to materials from all sort of web archives The motivation for defining a PWID namespace is the growing challenge of references to web resources, which the PWID as a URN can assist in overcoming. The standard is needed to address web materials meeting precision and persistency issues on par precision in with traditional references for analogue material. This regards both referencing of web resources from research papers and definition of web collection/corpus. In detail the challenges are: * Citation guidelines generally do not cover general and persistent referencing techniques for web resources that are not registered by Persistent Identifier systems (like DOI [DOI]). However, an increasing number of references point to resources that only exist on the web, e.g. blogs that turned out to have a historical impact. In order to obtain persistency for a reference, the target need to be stable. As the live web is 'alive' and in constant change, persistency can only be obtained by referring to archived snapshots of the web. The PWID URN is therefore focused on referencing archived web material in a technology agnostic way (research documented in [IPRES] and [ResawRef]). * There are many new initiatives for web archive referencing, - most of them are centralised solutions which offers harvest and referencing, but these cannot be used for existing materials in web archives. Other initiatives only cover open web archives, which does not cover material in closed archives and where there is a risk of imprecision if a resource in an alternative archive is the result of resolving such a resource. The PWID URN is needed in order to fill these gaps where other techniques are not sufficient. * There are many different requirements for construction of collection definitions for web material besides precision and persistency. Recent research have found that various legal and sustainability issues leads to a need for a collection to be defined by references to the web parts in the collection. The PWID URN is needed in such definitions in order to fulfil these requirements and to enable a collection to cover web materials from more archives (research documented in [ResawColl]). The PWID is especially useful for web material where precision is in focus and/or there are references to materials from closed web archives requiring special grants in order to gain access. The Zierau Expires December 11, 2018 [Page 5] Internet-DraAtPersistent Web IDentifier (PWID) URN Namespace June 2018 precision regards both regards precise reference where there can be no doubt about that you have the correct web material as well as precision about what is actually referred by the reference (e.g. is it the page or the whole website) Furthermore the PWID is very useful in specification of contents of a web collection (also known as web corpus). Definitions of web collections are often needed for extraction of data used in production of research results, e.g. for evaluations in the future. Current practices today are not persistent as they often use some CDX version, which vary for different implementations. Strict unambiguous syntax is needed for the PWID reference in order to ensure that it can be used for computational purposes. This is relevant for web collection definitions, which will need a strict syntax in order to be a basis for automatic extraction. Furthermore, readers of research papers are today expecting to be able to access a referenced resource by clicking an actionable URI, therefore a similar facility will be expected for references to available archived web material, which strict syntax can make possible. Examples of technical solutions that is enabled by strict are: * resolving of a references and automatic extraction of web collection defined by PWID URNs [ResawRef] [ResawColl] * Resolving of a PWID reference by resolving services. As a start, there is work on a prototype that can work for the Danish web archive data and open web archives with standard patterns for the current technologies. There may come different implementations for resolving which may rely on different protocols and application The purpose of the PWID is also to express a web archive reference as simple as possible and at the same time meeting requirements for sustainability, usability and scope. Therefore, the PWID URN is focused on only having the minimum required information to make a precise identification of a resource in an arbitrary web archive. Resent research have found that this is obtain by the following information [ResawRef]: * Identification of web archive * Identification of source: + Archived URI or identifier + Archival timestamp Zierau Expires December 11, 2018 [Page 6] Internet-DraAtPersistent Web IDentifier (PWID) URN Namespace June 2018 * Intended coverage (page, part, subsite etc.) The PWID URN represents this information in an unambiguous way, and thus enabling technical solutions to be defined in this URN. Syntax: The syntax of the PWID URN is specified below in Augmented Backus- Naur Form (ABNF) [RFC5234] and it conforms to URN syntax defined in RFC 8141 [RFC8141]. The syntax definition of the PWID URN is: pwid-urn = "urn" ":" pwid-NID ":" pwid-NSS pwid-NID = "pwid" pwid-NSS = archive-id ":" archival-time ":" coverage-spec ":" archived-item archive-id = +( unreserved ) archival-time = full-date datetime-delim full-pwid-time datetime-delim = "T" full-pwid-time = time-hour [":"] time-minute [":"] time-second "Z" coverage-spec = "part" / "page" / "subsite" / "site" / "collection" / "recording" / "snapshot" / "other" archived-item = URI / archived-item-id archived-item-id = +( unreserved ) where * 'unreserved' is defined as in RFC 3986 [RFC3986] * 'coverage-spec' values are not case sensitive (i.e. "PAGE" / "PART" / "PaGe" / ... are valid values as well.) * 'archival-time' is a UTC timestamp conforming to the W3C profile ISO8601 ISO 8601 [ISO8601] (also defined in RFC 3339 [RFC3339]), with a few exception. It has to be a UTC timestamp in order to conform with web archiving practices, which always uses UTC in order to avoid confusions. The 'full-date' is defined as in RFC 3339 [RFC3339]. The 'archival-time' must represent the time specified in the archive, and can therefore be specified at any of the levels of granularity as described in [W3CDTF] and in accordance with teh WARC standard ISO 28500 [ISO28500]. Zierau Expires December 11, 2018 [Page 7] Internet-DraAtPersistent Web IDentifier (PWID) URN Namespace June 2018 In line with RFC 3339 [RFC3339] the "T" may alternatively be lower case "t". 'time-hour', 'time-minute' and 'time-second' are defined as in RFC 3339 [RFC3339]. In line with RFC 3339 [RFC3339] the "Z" may alternatively be lower case "z". * 'URI' is defined as in RFC 3986 [RFC3986] The 'coverage-spec' defines the type of archived item, serving as a precision to what is referred: * part the single archived element, e.g. a pdf, a html text, an image * page the full context as a page, e.g. a html page with referred images * subsite the full context as a subsite within its domain, e.g. a document represented in a web structure * site the full context as a site within its domain * collection a collection/corpora definition, e.g. defined as descibed in [ResawColl] * snapshot a snapshot (image) representation of web material, e.g. a web page * recording a recording of a web browsing * other if something else Assignment: There are no authorities for assigning PWID URNs to resources, as the rule is the it is the given by the syntax that the name is assigned according to the Zierau Expires December 11, 2018 [Page 8] Internet-DraAtPersistent Web IDentifier (PWID) URN Namespace June 2018 * Identification of web archive * Identification of source: + Archived URI or identifier + Archival timestamp * Intended coverage (page, part, subsite etc.) Therefore, the PWID URNs are created independently, but following an algorithm that itself guarantees uniqueness. The name will always be unique, as the only way to define a clash would be that a web archive cease to exist, and by time another web archive gets the same name space and have resources with the same name and same archiving timestamp, but with different contents. This is a highly unlikely scenario. Security and Privacy: Security and privacy considerations are restricted to accessible web resources in web archives. If resolvers to PWID URNs are created, there should be made an analysis of whether they can be restricted to the former mentioned registry of web archives. Security and privacy will then be a question of security and privacy considerations related to the web archive resources. Interoperability: This is covered by comments in the Syntax description: * the PWID URN conforms to the URI standard defined as in RFC 3986 [RFC3986] and the URN standard RFC 8141 [RFC8141] * the 'archival-time' of the PWID URN conforms to the URI standard defined as in RFC 3986 [RFC3986]W3C profile ISO 8601 [ISO8601] (also defined in RFC 3339 [RFC3339]) and to the WARC standard ISO 28500 [ISO28500] using UTC dates only * the 'archived-item' is a URI which conforms to the URI standard defined as in RFC 3986 [RFC3986] Resolution: The information in a PWID URN can be used for locating a web archive resource, for any kind of web archive. It includes the minimum information for web archive materials, which enables Zierau Expires December 11, 2018 [Page 9] Internet-DraAtPersistent Web IDentifier (PWID) URN Namespace June 2018 resolvability, manually or by a resolver. The plan is to develop a resolving service, but this is only in a prototype form at the moment. Resolution of a PWID URN is the primary motivation of making a formal URN definition, instead of just textual representation of the for needed parts of a PWID: * Web archive identification to find the archive holding the material * Archived URI or identifier of item as part of identifying the material * Date and time associated with the archived URI/item as part of precise identification of the material * Coverage of what is referred as part of clarification of what the referred material covers (page, part etc.) in the following the different resolution techniques are explained (manual as well as via a service) An example of a PWID URN is: urn:pwid:archive.org:2016-01-22T11:20:29Z:page:http://www.dr.dk has the information: * archive.org currently known identifier in form of the Internet Archive domian name for their open access web archive * 2016-01-22T11:20:29Z UTC date and time associated with the archived URI * page clarification that the reference cover the full web page with all its inherited parts selected by the web archive * http://www.dr.dk archived URI of item With knowledge of the current (2017) Internet Archive open access web interface having the form: https://web.archive.org/web/