Internet Engineering Task Force E. Zierau, Ed.
Internet-Draft Royal Danish Library
Intended status: Informational June 9, 2018
Expires: December 11, 2018

Scheme Specification for the pwid URI
draft-pwid-uri-specification-04

Abstract

This document specifies a Uniform Resource Identifier (URI) for Persistent Web IDentifiers to web material in web archives using the 'pwid' scheme name. The purpose of the standard is to support general, global, sustainable, humanly readable, technology agnostic, persistent and precise web references for web materials in web archives in a way that can make them potentially resolvable.

The PWID URI can assist in two ways: First, by providing potential resolvable precise and persistent reference scheme for web archive materials, which is not sufficiently covered by existing web reference practices and new suggested referencing methods. Second, to specify web elements in web collections (also known as web corpus) even for collections where there are references to web elements in several archives.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on December 11, 2018.

Copyright Notice

Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Introduction

The purpose of the PWID URI is to represent general, global, sustainable, humanly readable, technology agnostic, persistent and precise web archive resource references in a way that;

The motivation for defining a PWID URI scheme is the growing challenge of references to web resources, which the PWID as a URI can assist in overcoming. The standard is needed to address web materials meeting precision and persistency issues on par precision in with traditional references for analogue material. This regards both referencing of web resources from research papers and definition of web collection/corpus. In detail the challenges are:

The PWID is especially useful for web material where precision is in focus and/or there are references to materials from closed web archives requiring special grants in order to gain access. The precision regards both regards precise reference where there can be no doubt about that you have the correct web material as well as precision about what is actually referred by the reference (e.g. is it the page or the whole website)

Furthermore the PWID is very useful in specification of contents of a web collection (also known as web corpus). Definitions of web collections are often needed for extraction of data used in production of research results, e.g. for evaluations in the future. Current practices today are not persistent as they often use some CDX version, which vary for different implementations.

For the sake of usability and sustainability, the definition of the PWID URI scheme is focused on only having the minimum required information to make a precise identification of a resource in an arbitrary web archive. Resent research have found that this is obtain by the following information [ResawRef]:

The PWID URI scheme represents this information in an unambiguous way, and thus enabling technical solutions to be defined based on this scheme.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

2. Demonstrable, New, Long-Lived Utility

The purpose of the PWID URI is to represent needed referencing information (as listed in the introduction) in a scheme that can be used for technical solutions. As described in [ResawColl] such references can be represented in a textual way. However, strict unambiguous syntax is needed in order to ensure that it can be used for computational purposes. This is relevant for web collection definitions, which will need a strict scheme in order to be a basis for automatic extraction. Furthermore, readers of research papers are today expecting to be able to access a referenced resource by clicking an actionable URI, therefore a similar facility will be expected for references to available archived web material.

The interest for this new PWID URI scheme has already been shown, a paper about the invention of the PWID URI "Persistent Web References - Best Practices and New Suggestions" [IPRES] was accepted for the iPres 2016 conference and nominated as best paper. At the RESAW 2017 conference there are two related papers: One on referencing practices [ResawRef] and one on research data management practices [ResawColl]. The interest for the PWID URI so far indicates that this is a recognized issue, and that the PWID URI can fill a gap.

The PWID URI could function as a URN [RFC8141], and will be as a starting point (proposal has been be sent in December 2017 with updates June 2018). The ambition is to make an easily understandable and technology independent persistent identifier, where the prefixing of "urn:" will be desturbing. Therefore it is also suggested as an URI, as there in time will come a way where it can function as a URI and also enjoy the same common syntactic, semantic, and shared language benefits that the URI presentation confers.

It should be noted that for closed web archives, the PWID URI can be used to resolve within a closed environment. Likewise, the PWID can be resolved within coming web archive research infrastructure, which is currently being proposed in the RESAW community [RESAW].

3. Syntactic Compatibility

The syntax of the PWID URI Scheme is specified below in Augmented Backus-Naur Form (ABNF) [RFC5234] and it conforms to URI syntax defined in [RFC3986]. The syntax definition of the PWID URI is:

  pwid-uri  = pwid-scheme ":" pwid-spec
         

  pwid-scheme = "pwid"
  pwid-spec = archive-id ":" archival-time ":" coverage-spec 
              ":" archived-item
         

  archive-id  = +( unreserved )
         

  archival-time   = full-date datetime-delim full-pwid-time
  datetime-delim  = "T"
  full-pwid-time  = time-hour ["."] time-minute ["."] time-second "Z"
         

  coverage-spec    = "part" / "page" / "subsite" / "site" 
                    / "collection" / "recording" / "snapshot"
                    / "other"
         

  archived-item = URI / archived-item-id
  archived-item-id  = +( unreserved )
         

where

The 'coverage-spec' defines the type of archived item, serving as a precision to what is referred:

Note that the 'coverage-spec' is a parameter that could have been specified as a query. However, since the 'pwid-uri' can include an URI as 'archived-item', it would introduce ambiguities if the 'coverage-spec' was specified as a query, since it would not be clear whether the query belonged to the 'pwid-uri' or the 'archived-item'.

4. Well Defined

The information in a PWID URI can be used for locating a web archive resource, for any kind of web archive. It includes the minimum information for web archive materials, which enables resolvability, manually or by a resolver. One of the reasons for defining PWID as a URI is to enable a general, technology agnostic, persistent representation to be resolvable at any time.

The information needed is:

For example the PWID URI:

has the information:

With knowledge of the current (2017) Internet Archive open access web interface having the form:

We can manually (or technically) deduce an actual (current 2017) access https address:

and regard the referred web page as the reference.

The same recipe can be used for other Wayback platforms - and possibly also other web archive access tools platforms, as the crucial information is date and URI, which are requested to be looked up in a specified archive.

Note that this also includes access to archives that are only accessible via a local proxy to a restricted environment. Here the difference is that the archive information is used to identify the local environment used (possibly on-site) and then construct local http/https address based on knowledge from the local access installation. In November there was created a prototype for PWIDs to the Netarkivet, and there are plans to extend it.

5. Definition of Operations

The PWID URI Scheme is another step in facilitating, supporting, and standardizing the problem of persistent web references to resources in web archives. There is not a specific definition of computational operation yet. It is expected that there may be different implementations in pace with needed use and available technology and infrastructures.

Automatic access of a referenced web resource may work on the open net for open web archive or in restricted environments for the closed web archives. There may be a need for varied operation depending on the available technology and applications, e.g.:

Use of URIs for standard web archive interfaces is preferred as dependency on registries and infrastructures may pose too many limits.

6. Context of Use

The PWID URI scheme facilitates, supports and standardise a scheme for specification of identification of web archive resources in a general, global, sustainable, humanly readable, technology agnostic, persistent and precise way. The standard is needed to address web materials meeting precision and persistency issues on par precision in with traditional references for analogue material.

The purpose with the PWID URI is to represent this information in a scheme that can be used for technical solutions, for example for resolving of a references and automatic extraction of web collection defined by PWID URIs [ResawRef] [ResawColl]. As described above, there may come different implementations for resolving which may rely on different protocols and application.

7. Internationalization and Character Encoding

Internationalization and character encoding for PWID URIs are relevant for the 'webarchive-id' and 'archived-item' syntactical units of the scheme-specific-part of the PWID URI. The rest of the main syntactical units ('archival-time' and 'coverage-spec') are only constructed by a very limited set of characters, and do therefore need internationalization and character encoding.

The 'webarchive-id' will not be case sensitive, but can allow for percent encodings, although for simplicity reasons, it may turn out that the coming establishment of an archiving registry will recommend using letters that do not need encodings.

The 'archived-item' follows the rules of URIs in general (currently for http and https URIs archived in web archives). The 'archived-item' is only case sensitive to the extent that the web archive can handle archived case sensitive URIs.

8. Scheme Name Considerations

The scheme name is "pwid" - short for Persistent Web Identifier. Initially, the scheme name "wpid" was reserved. However, one of the feedbacks has been a concern that "wpid" was interpreted as a PID related to a PID-system, e.g. as the DOI. All though PID does not have a precise definition that makes it wrong to call it a "wpid", the danger is that it is confused with PID systems, which is not the intension. Consequently, this suggestion names the scheme "pwid" instead.

9. Interoperability Considerations

This is covered by comments on the date in the section of Syntactic Compatibility, where the 'archival-time' conforms to the W3C profile ISO8601, except for minor modification in order to make it fit into a URI. Furthermore, the 'archived-item' conforms to the URI standard.

10. Acknowledgements

A special thanks to Caroline Nyvang and Thomas Kromann who have contributed to the research identifying the minimum information required in a persistent web reference, and to Bolette Jurik contributed with supplementary research concerning requirements for web collection/copora definitions. Also thanks to all that have contributed to this work with the research and reviewing this RFC.

11. IANA Considerations

The URI scheme name 'pwid' is reserved as a provisional URI as result of request IANA #938449

12. Clear Security and Privacy Considerations

Security and privacy considerations are restricted to accessible web resources in web archives. If resolvers to PWID URIs are created, there should be made an analysis of whether they can be restricted to the former mentioned registry of web archives. Security and privacy will then be a question of security and privacy considerations related to the web archive resources.

13. References

13.1. Normative References

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997.
[RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet: Timestamps", RFC 3339, DOI 10.17487/RFC3339, July 2002.
[RFC3986] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, DOI 10.17487/RFC3986, January 2005.
[RFC5234] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, DOI 10.17487/RFC5234, January 2008.

13.2. Informative References

[DOI] International DOI Foundation, "The DOI System", 2016.

pwid:archive.org:2016-10-20T22.26.35:site:https://www.doi.org/

[IPRES] Zierau, E., Nyvang, C. and T. Kromann, "Persistent Web References - Best Practices and New Suggestions", October 2016.

In: proceedings of the 13th International Conference on Preservation of Digital Objects (iPres) 2016, pp. 237-246

[ISO28500] International Organization for Standardization, "Information and documentation -- WARC file format", 2017.
[ISO8601] International Organization for Standardization, "Data elements and interchange formats -- Information interchange -- Representation of dates and times", 2004.
[RESAW] The Resaw Community, "A Research infrastructure for the Study of Archived Web materials", 2017.

pwid:archive.org:2017-05-29T11.31.50Z:site:http://resaw.eu/

[ResawColl] Jurik, B. and E. Zierau, "Data Management of Web archive Research Data", 2017.

In: proceedings of the RESAW 2017 Conference, DOI: 10.14296/resaw.0002

[ResawRef] Nyvang, C., Kromann, T. and E. Zierau, "Capturing the Web at Large - a Critique of Current Web Referencing Practices", 2017.

In: proceedings of the RESAW 2017 Conference, DOI: 10.14296/resaw.0004

[RFC6068] Duerst, M., Masinter, L. and J. Zawinski, "The 'mailto' URI Scheme", RFC 6068, DOI 10.17487/RFC6068, October 2010.
[RFC8141] Saint-Andre, P. and J. Klensin, "Uniform Resource Names (URNs)", RFC 8141, DOI 10.17487/RFC8141, April 2017.
[W3CDTF] W3C, "Date and Time Formats: note submitted to the W3C. 15 September 1997", 1997.

W3C profile of ISO 8601 pwid:archive.org:2017-04-03T03.37.42Z:page:http://www.w3.org/TR/NOTE-datetime

Author's Address

Eld Maj-Britt Olmuetz Zierau (editor) Royal Danish Library Soeren Kierkegaards Plads 1 Copenhagen, 1219 Denmark Phone: +45 9132 4690 EMail: elzi@kb.dk