Internet Engineering Task Force E. Zierau, Ed.
Internet-Draft Royal Danish Library
Intended status: Informational September 6, 2019
Expires: March 9, 2020

A Persistent Web IDentifier (PWID) URN Namespace
draft-pwid-urn-specification-09

Abstract

This document specifies a Uniform Resource Name (URN) for Persistent Web IDentifiers for web material in web archives using the 'pwid' namespace identifier.

The main purpose of the standard is to support specification of references that are not covered by other reference techniques: to support references to material in web archives with restricted access. Furthermore, it supports persistent technology agnostic references to web archives in general, in a form that can work as an algorithmic basis for finding web archive resources in general. An additional important benefit is that the standard can be used for specifying web collections, which can then form a persistent computational basis for the extract of the archived collection parts.

The PWID URN is designed to meet requirements for proper referencing needed by researchers. Therefore, it is designed as general, global, sustainable, humanly readable, technology agnostic, persistent and precise web references for web materials in web archives.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on March 9, 2020.

Copyright Notice

Copyright (c) 2019 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Introduction

The PWID URN is a supplement to existing reference standards, where the PWID URN will support references to web archives, including areas that are not supported today: support of references to material in web archives with restricted access. Furthermore, the PWID URN enables technology agnostic references to web archives in general, which can be needed, for instance for references to dynamic web material with frequent updates (e.g. a news site) or a specific version of a web material (e.g. specific version of the DOI handbook).

The PWID URN is in a form which can work as an algorithmic basis for finding the resource. This also enables computation of archived web parts to a collection from one or more web archives, if the collection parts are specified by PWID URNs.

Furthermore, the PWID URN includes information about the resource which makes it possible to find alternative resources, in cases where the original precise resource has become unavailable.

The PWID URN is designed to be a persistent reference that is general, global and technology agnostic in order to enhance its chances of being sustainable. Furthermore, it is designed to be humanly readable and with an ability to specify precision about what the referenced web archive resource covers. This design enables a PWID URN to:

The motivation for defining a PWID namespace is the growing challenges of references to archived web resources, and the PWID as a URN can assist in overcoming a lot of these challenges. The standard is needed to address web materials meeting precision and persistency issues on par with precision in traditional references for analogue material. Furthermore, it is needed in order to address web archive resources that are not freely available online. The PWID URN covers both referencing of web resources from research papers and definition of web collections/corpora. In detail the challenges are:

The PWID URN is especially useful for web material where precision is in focus and/or there are references to materials from web archives requiring special permissions in order to gain access. The precision regards two aspects. Firstly, pointing out the archive where the resource was found and validated against its purpose (other archived versions in other web archives may differ both regarding completeness and contents even within short time periods). Secondly, specifying whether the referred resource is a web page or a part in form of one file.

The possibility of specifying the part/file precision enables the PWID URN to be used in specification of contents of a web collection. Definitions of web collections are often needed for extraction of data used in production of research results, e.g. for future evaluations. Current practices are not persistent as they often use some CDX version, which vary for different implementations.

Strict syntax is needed for the PWID URN, in order to ensure that it can act as a reference which can used for computational purposes. This is especially relevant for automatic extraction of parts from web collection definitions. Furthermore, today's readers of research papers are expecting to be able to access a referenced resource by clicking an actionable URI, therefore a similar possibility will be expected for references to available archived web material, and this is possible with a strict syntax. A prototype for resolving URN PWIDs has been developed for the Danish web archive data and open web archives with standard patterns for the current technologies. Implementations for resolution of PWID URNs for other web archives may be developed.

The purpose of the PWID URN is also to express a web archive reference as simple as possible and at the same time meet the requirements for sustainability, usability and scope. Therefore, the PWID URN is focused on having only the minimum required information to make a precise identification of a resource in an arbitrary web archive. Recent research have shown that this can be obtained by the following information [ResawRef]:

The PWID URN represents this information in a human readable way as well as a well-defined way that enables technical solutions to interpret the URN.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].

2. Namespace Registration Template

Namespace Identifier:

Version:

Date:

Registrant:

Purpose:

     pwid-urn = "urn:" pwid-NID ":" pwid-NSS 
               
     pwid-NID = "pwid"
     pwid-NSS = archive-domain ":" archival-time ":" precision-spec 
                           ":" archived-uri
               
     archival-time = utc-date ["T" utc-time] "Z"
     utc-date   = utc-year "-" utc-month "-" utc-day
     utc-year   = 4DIGIT
     utc-month  = 2DIGIT  ; 01-12
     utc-day    = 2DIGIT  ; 01-28, 01-29, 01-30, 01-31 based on
                      ; month/year in UTC time
     utc-time   = utc-hour ":" utc-minute [":" utc-second [secfrac]] 
     utc-hour   = 2DIGIT  ; 00-23
     utc-minute = 2DIGIT  ; 00-59
     utc-second = 2DIGIT  ; 00-58, 00-59, 00-60 based on leap second
                                ; rules
     secfrac       = "." 1*9DIGIT
               
     precision-spec = "part" / "page" 
               

Syntax:

     "urn:pwid:" archive-domain ":" archival-time ":" precision-spec
                            ":" archived-uri 
             

Assignment:

Security and Privacy:

Interoperability:

Resolution:

Documentation:

Additional Information:

Revision Information:

3. Acknowledgements

A special thanks to Caroline Nyvang and Thomas Kromann who have contributed to the research identifying the minimum information required in a persistent web reference, and to Bolette Jurik who contributed with supplementary research concerning requirements for web collection/corpora definitions. Also thanks to everybody who has contributed to this work with the research parts and with reviewing of this RFC.

4. References

4.1. Normative References

[RFC1034] Mockapetris, P., "Domain names - concepts and facilities", STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997.
[RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet: Timestamps", RFC 3339, DOI 10.17487/RFC3339, July 2002.
[RFC3986] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, DOI 10.17487/RFC3986, January 2005.
[RFC5234] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, DOI 10.17487/RFC5234, January 2008.
[RFC8141] Saint-Andre, P. and J. Klensin, "Uniform Resource Names (URNs)", RFC 8141, DOI 10.17487/RFC8141, April 2017.

4.2. Informative References

[DOI] International DOI Foundation, "The DOI System", 2016.

urn:pwid:archive.org:2016-10-20T22:26:35:page:https://www.doi.org/

[DraftPwidUri] Zierau, E., "DRAFT: Scheme Specification for the pwid URI, Version 4", June 2018.
[IDCC2019] Zierau, E., "Web References Meeting Requirements for Proper Referencing Principles"", February 2019.

Poster at 14th International Digital Curation Conference (iDCC) 2019

[IPRES2016] Zierau, E., Nyvang, C. and T. Kromann, "Persistent Web References - Best Practices and New Suggestions", October 2016.

In: proceedings of the 13th International Conference on Preservation of Digital Objects (iPres) 2016, pp. 237-246

[IPRES2018] Zierau, E., "Precise and Persistent Web Archive References - Status, context and expected progress of the PWID", September 2018.

In: proceedings of the 15th International Conference on Preservation of Digital Objects (iPres) 2018, DOI: 10.17605/OSF.IO/U5W3Q

[ISO28500] International Organization for Standardization, "Information and documentation -- WARC file format", 2017.
[ISO8601] International Organization for Standardization, "Data elements and interchange formats -- Information interchange -- Representation of dates and times", 2004.
[MEMENTO] Memento Development Group, "About the Memento Project", January 2015.

urn:pwid:archive.org:2018-11-01T15:26:28Z:page:http://mementoweb.org/about/

[PWIDprovider] Royal Danish Library (Netarkivet), "SolrWayback 3.1", 2018.

urn:pwid:archive.org:2018-06-11T02:00:05Z:page:https://github.com/netarchivesuite/solrwayback

[PWIDresolver] Royal Danish Library (Netarkivet), "NAS-research version 0.0.6", 2018.

urn:pwid:archive.org:2018-07-16T06:53:51Z:page:https://github.com/netarchivesuite/NAS-research/releases/tag/0.0.6

[ResawColl] Jurik, B. and E. Zierau, "Data Management of Web archive Research Data", 2017.

In: proceedings of the RESAW 2017 Conference, DOI: 10.14296/resaw.0002

[ResawRef] Nyvang, C., Kromann, T. and E. Zierau, "Capturing the Web at Large - a Critique of Current Web Referencing Practices", 2017.

In: proceedings of the RESAW 2017 Conference, DOI: 10.14296/resaw.0004

[W3CDTF] W3C, "Date and Time Formats: note submitted to the W3C. 15 September 1997", 1997.

urn:pwid:archive.org:2017-04-03T03:37:42Z:page:http://www.w3.org/TR/NOTE-datetime

Author's Address

Eld Maj-Britt Olmuetz Zierau (editor) Royal Danish Library Soeren Kierkegaards Plads 1 Copenhagen, 1219 Denmark Phone: +45 9132 4690 EMail: elzi@kb.dk