The Archive and Packaging Pointer systemThe University of ManchesterOxford RoadManchesterUnited Kingdomstain@apache.orgMozilla Corporationmarcos@marcosc.com
General
Internet-DraftThis Internet-Draft proposes the
Archive and Packaging Pointer system with the URI
scheme app.app URIs can be used to consume or reference hypermedia
resources bundled inside a file archive or a mobile
application package, as well as to resolve URIs for
archive resources within a programmatic framework.This URI scheme provides mechanisms to generate a
unique base URI to represent the root of the archive,
so that relative URI references in a bundled resource
can be resolved within the archive without having to
extract the archive content on the local file system.An app URI can be used for purposes of isolation
(e.g. when consuming multiple archives),
security constraints (avoiding “climb out” from the archive),
or for externally identiyfing sub-resources in
other hypermedia formats.The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL
NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and
“OPTIONAL” in this document are to be interpreted as described in
.Applications that are accessing resources bundled inside a
file archive (e.g. zip or tar.gz) can struggle to consume
hypermedia content types that use relative
URI references , as it is challenging to
determine the base URI in a consistent fashion.Frequently the archive must be unpacked locally to
synthesize base URIs like file:///tmp/a1b27ae03865/
to represent the root of the archive. Such URIs are fluctual,
might not be globally unique, and could be vulnerable to
attacks such as “climbing out” of the root directory.Mobile and Web applications that are distributed as packages
may bundle resources such as stylesheets with
relative URI references to images and fonts.An archive containing multiple HTML or
Linked Data resources, such as in a
BagIt archive , may be using
relative URIs to cross-reference constituent files.Consumptions of archives might be performed
in memory or through a common framework, abstracting
away any local file location.Consumption of an archive with a consistent base URL
should be possible no matter from which location it was retrieved,
or on which device it is inspected.When consuming multiple archives from untrusted sources
it would be beneficial to have a Same Origin policy
so that relative hyperlinks can’t escape the particular archive.The file: URI scheme
can be ill-suited for purposes such as above, where a
location-independent URI scheme is more flexible,
secure and globally unique.The app URI scheme follows the syntax for hierarchical
URIs according to the following production:The app-authority component provides a unique identifier for the opened archive.
See for details.The path-absolute component provides the absolute path of a resource
(e.g. a file or directory) within the archive. See
for details.The semantics of the query component is undefined by this Internet-Draft.
Implementations SHOULD NOT generate a query component for app URIs.The “fragment” component MAY be used by implementations according
to and the implied media type of the
resource at the path. This Internet-Draft does not specify how
to determine the media type.The purpose of the authority component in an app URI is
to build a unique base URI for a particular archive. The
authority is NOT intended to be resolvable without former
knowledge of the archive.The authority of an app URI MUST be valid according to
this production:The UUID production match its definition in , e.g.
2a47c495-ac70-4ed1-850b-8800a57618cfThe alg-val production match its definition in , e.g.
sha-256;JCS7yveugE3UaZiHCs1XpRVfSHaewxAKka0o5q2osg8The authority production match its definition in , e.g. example.com.
As this production necessarily also match the UUID and alg-val
productions, consumers of app URIs should attempt to match those first.
While section 2.2 says an extension may not
“define the structure or the semantics for URI authorities”,
extensions of this Internet-Draft are permitted to do so,
if using a DNS domain name under their control.
For instance, a vendor owning example.com may specify that
{OID} in {OID}.oid.example.com has special semantics.The choice of authority depends on the purpose of the app URI within the implementation.
Below are some recommendations:Sandboxing, when independently interpreting resources in
an archive, the authority SHOULD be a
UUID v4 created with a suitable random number generator .
This ensures with high probablity that
the app base URI is globally unique. An application MAY choose to
reuse a previously assigned UUID that is associated with the archive.Location-based, for referencing resources in an archive accessed at a
particular URL, the authority SHOULD be generated as a name-based UUID v5 ; that is
based on the SHA1 concatination of the URL namespace
6ba7b811-9dad-11d1-80b4-00c04fd430c8 (as UUID bytes) and the
ASCII bytes of the particular URL. It is NOT RECOMMENDED to use this approach
with a file URI without a fully qualified host name.Hash-based, for referencing resources in an archive as a
particular bytestream, independent of its location, the authority SHOULD be
a checksum of the archive bytes. The checksum MUST be expressed
according to ’s alg-val production, and SHOULD use the
sha-256 algorithm. It is NOT RECOMMENDED to use truncated hash methods.The generic authority production MAY be used
for extensions if the above mechanisms are not suitable.
Care should be taken so that the custom authority
do not match the UUID nor alg-val productions.The path-absolute component MUST match the production in
and provide the absolute path of a resource
(e.g. a file or directory) within the archive.Archive media types vary in constraints and flexibilities
of how to express paths. Here we assume an archive generally
consists of a single root directory, which can contain
multiple directories and files at arbitrary nesting levels.Paths SHOULD be expressed using / as the directory separator.
The below productions are from :In an app URI, each intermediate segment (or segment-nz)
represent a directory name, while the last segment represent
either a directory or file name.It is RECOMMENDED to include the trailing / if it is known
the path represents a directory.This Internet-Draft does not constrain what particular format
might constitute an archive, and neither does it require
that the archive is retrievable as a single bytestream or file.
Examples of archive media types include
application/zip, application/vnd.android.package-archive,
application/x-tar, application/x-gtar and
application/x-7z-compressed.The authority component identifies the archive file.The path component of an app URI identify individual
resources within a particular archive, typically
a directory or file.If the path is missing/empty - e.g.
app://833ebda2-f9a8-4462-b74a-4fcdc1a02d22 - then
the app URI represent the whole archive file.If the path is / - e.g.
app://833ebda2-f9a8-4462-b74a-4fcdc1a02d22/ -
then the app URI represent the root directory
of the archive.If the path ends with / then the path represents
a directory in the archiveThe app URIs can be used for uniquely identifying
the resources independent of the location of the archive,
such as within an information system.Assuming an appropriate resolution mechanism which have
knowledge of the corresponding archive, an app URI
can also be used for resolution.This Internet-Draft do not specify directly the protocol to
resolve resources according to the app URI scheme.
For instance, one implementation might rewrite app URIs to
localized file:/// paths in a temporary directory, while
another implementation might use an embedded HTTP server.It is envisioned that an implementation will
have extracted or opened an archive in
advance, and assigned it an appropriate authority according
to . Such an implementation
can then resolve app URIs programmatically, e.g. by using
in-memory access or mapping paths to the extracted archive on
the local file system.Implementations that support resolving app URIs SHOULD:Fail with the equivalent of Not Found if the authority is unknown.Fail with the equivalent of Gone if the authority is known, but the content of the archive is no longer available.Fail with the equivalent of Not Found if the path does not map to a file or directory within the archive.Return the corresponding (potentially uncompressed) bytestream if the path maps to a file within the archive.Return an appropriate directory listing if the path maps to a directory within the archive.Return an appropriate directory listing of the archive’s root directory if the path is /Return the archive file if the path component is missing/empty.Not all archive formats or implementations will have the
concept of a directory listing, in which case the directory listing
SHOULD fail with the equivalent of “Not Implemented”.It is not specified in this Internet-Draft how an implementation
can determine the media type of a file within an archive. This may
be expressed in secondary resources (such as a manifest),
be determined by file extensions or magic bytes.The media type text/uri-list MAY be used to represent
a directory listing, in which case it SHOULD contain only URIs
that start with the app URI of the directory.Some archive formats might support resources which are
neither directories nor regular files (e.g. device files,
symbolic links). This Internet-Draft does not specify the
semantics of attempting to resolve such resources.This Internet-Draft does not specify how to change an archive
or its content using app URIs.If the authority component of an app URI matches the alg-val
production, an application MAY attempt to resolve the authority
from any .well-known/ni/ endpoint as specified in
section 4, in order to retrieve the complete
archive. Applications SHOULD verify the checksum of the
retrieved archive before resolving the individual path.The production for UUID and alg-val are restricted to
ASCII and should not require any encoding considerations.Care should be taken to %-encode the directory and file segments
of path-absolute according to (for URIs) or
(for IRIs).When used as part an IRI, paths SHOULD be expressed using
international Unicode characters instead of %-encoding as ASCII.Not all archive media types have an explicit
character encoding specified for their paths.
If no such information is available for the archive format,
implementations MAY assume that the path component
is encoded with UTF-8 .Some archive media types are case-insensitive, in
which cases it is RECOMMENDED to preserve the casing
as expressed in the archive.As multiple authorities are possible (),
there could be interoperability challenges when exchanging app URIs
between implementations. Some considerations:Two implementations describe the same archive
(e.g. stored in the same local file path), but using
different v4 UUIDs. The implementations may
need to detect equality of the two UUIDs out of band.Two implementations describe an archive retrieved
from the same URL, with the same v5 UUIDs, but retrieved
at different times. The implementations might disagree
about the content of the archive.Two implementations describe an archive retrieved
from the same URL, with the same v5 UUIDs, but retrieved
using different content negotiation resulting in different
archive representations. The implementations may disagree
about path encoding, file name casing or hierarchy.Two implementations describe the same archive bytestream
using the alg-val production, but they have used
two different hash algorithms. The implementations may
need to negotiate to a common hash algorithm.An implementation describe an archive using
the alg-val production, but a second
implementation concurrently modifies the archive’s content.
The first implementation may need to detect changes to
the archive or verify the checksum at the end of its operations.Two implementations might have different views of the
content of the same archive if the format permits
multiple entries with the same path. Care should
be taken to follow the convention and specification
of the particular archive format.Two implementations that access the same archive
which contain file paths with Unicode characters,
but they extract to two different file systems. Limitations
and conventions for file names in the local file system
(e.g. Unicode normalization, case insensitivity, total path length)
may result in the implementations having
inconsistent or inaccessible paths.As when handling any content, extra care should be taken when
consuming archives and app URIs from unknown sources.An archive could contain compressed files that expand to
fill all available disk space.A maliciously crafted archive could contain paths with characters
(e.g. backspace) which could make an app URI invalid or
misleading if used unescaped.A maliciously crafted archive could contain paths
(e.g. combined Unicode sequences) that cause the
app URI to be very long, causing issues in information
systems propagating said URI.An archive might contain symbolic links that, if
extracted to a local file system, might address files
outside the archive’s directory structure.An maliciously crafted app URI might contain ../ segments,
which if naively converted to a file:/// URI might address
files outside the archive’s directory structure.In particular for IRIs, an archive might contain multiple
paths with similar-looking characters or with different
Unicode combine sequences, which could be facilitated
to mislead users.An URI hyperlink might use or guess an app URI authority
to attempt to climb into a different archive for
malicious purposes. Applications SHOULD employ
Same Orgin policy checks.This Internet-Draft contains the Provisional IANA
registration of the app URI scheme according to .Scheme name: appStatus: provisionalApplications/protocols that use this protocol:
Hypermedia-consuming application that handle archives.Contact: Stian Soiland-Reyes stain@apache.orgChange controller: Stian Soiland-ReyesMultipurpose Internet Mail Extensions (MIME) Part Two: Media TypesThis second document defines the general structure of the MIME media typing system and defines an initial set of media types. [STANDARDS-TRACK]Key words for use in RFCs to Indicate Requirement LevelsIn many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.UTF-8, a transformation format of ISO 10646UTF-8, the object of this memo, has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values. This memo updates and replaces RFC 2044, in particular addressing the question of versions of the relevant standards. [STANDARDS-TRACK]URI Resolution Services Necessary for URN ResolutionRetrieving the resource identified by a Uniform Resource Identifier (URI) is only one of the operations that can be performed on a URI. One might also ask for and get a list of other identifiers that are aliases for the original URI or a bibliographic description of the resource the URI denotes, for example. This applies to both Uniform Resource Names (URNs) and Uniform Resource Locators (URLs). Uniform Resource Characteristics (URCs) are discussed in this document but only as descriptions of resources rather than identifiers. This memo defines an Experimental Protocol for the Internet community.Uniform Resource Identifier (URI): Generic SyntaxA Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource. This specification defines the generic URI syntax and a process for resolving URI references that might be in relative form, along with guidelines and security considerations for the use of URIs on the Internet. The URI syntax defines a grammar that is a superset of all valid URIs, allowing an implementation to parse the common components of a URI reference without knowing the scheme-specific requirements of every possible identifier. This specification does not define a generative grammar for URIs; that task is performed by the individual specifications of each URI scheme. [STANDARDS-TRACK]Internationalized Resource Identifiers (IRIs)This document defines a new protocol element, the Internationalized Resource Identifier (IRI), as a complement of the Uniform Resource Identifier (URI). An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to URIs is defined, which means that IRIs can be used instead of URIs, where appropriate, to identify resources. The approach of defining a new protocol element was chosen instead of extending or changing the definition of URIs. This was done in order to allow a clear distinction and to avoid incompatibilities with existing software. Guidelines are provided for the use and deployment of IRIs in various protocols, formats, and software components that currently deal with URIs.Randomness Requirements for SecuritySecurity systems are built on strong cryptographic algorithms that foil pattern analysis attempts. However, the security of these systems is dependent on generating secret quantities for passwords, cryptographic keys, and similar quantities. The use of pseudo-random processes to generate secret quantities can result in pseudo-security. A sophisticated attacker may find it easier to reproduce the environment that produced the secret quantities and to search the resulting small set of possibilities than to locate the quantities in the whole of the potential number space.Choosing random quantities to foil a resourceful and motivated adversary is surprisingly difficult. This document points out many pitfalls in using poor entropy sources or traditional pseudo-random number generation techniques for generating such quantities. It recommends the use of truly random hardware techniques and shows that the existing hardware on many systems can be used for this purpose. It provides suggestions to ameliorate the problem when a hardware solution is not available, and it gives examples of how large such quantities need to be for some applications. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.A Universally Unique IDentifier (UUID) URN NamespaceThis specification defines a Uniform Resource Name namespace for UUIDs (Universally Unique IDentifier), also known as GUIDs (Globally Unique IDentifier). A UUID is 128 bits long, and can guarantee uniqueness across space and time. UUIDs were originally used in the Apollo Network Computing System and later in the Open Software Foundation\'s (OSF) Distributed Computing Environment (DCE), and then in Microsoft Windows platforms.This specification is derived from the DCE specification with the kind permission of the OSF (now known as The Open Group). Information from earlier versions of the DCE specification have been incorporated into this document. [STANDARDS-TRACK]Defining Well-Known Uniform Resource Identifiers (URIs)This memo defines a path prefix for "well-known locations", "/.well-known/", in selected Uniform Resource Identifier (URI) schemes. [STANDARDS-TRACK]The Web Origin ConceptThis document defines the concept of an "origin", which is often used as the scope of authority or privilege by user agents. Typically, user agents isolate content retrieved from different origins to prevent malicious web site operators from interfering with the operation of benign web sites. In addition to outlining the principles that underlie the concept of origin, this document details how to determine the origin of a URI and how to serialize an origin into a string. It also defines an HTTP header field, named "Origin", that indicates which origins are associated with an HTTP request. [STANDARDS-TRACK]Naming Things with HashesThis document defines a set of ways to identify a thing (a digital object in this case) using the output from a hash function. It specifies a new URI scheme for this purpose, a way to map these to HTTP URLs, and binary and human-speakable formats for these names. The various formats are designed to support, but not require, a strong link to the referenced object, such that the referenced object may be authenticated to the same degree as the reference to it. The reason for this work is to standardise current uses of hash outputs in URLs and to support new information-centric applications and other uses of hash outputs in protocols.URI Design and OwnershipSection 1.1.1 of RFC 3986 defines URI syntax as "a federated and extensible naming system wherein each scheme's specification may further restrict the syntax and semantics of identifiers using that scheme." In other words, the structure of a URI is defined by its scheme. While it is common for schemes to further delegate their substructure to the URI's owner, publishing independent standards that mandate particular forms of URI substructure is inappropriate, because that essentially usurps ownership. This document further describes this problematic practice and provides some acceptable alternatives for use in standards.Guidelines and Registration Procedures for URI SchemesThis document updates the guidelines and recommendations, as well as the IANA registration processes, for the definition of Uniform Resource Identifier (URI) schemes. It obsoletes RFC 4395.The "file" URI SchemeThis document provides a more complete specification of the "file" Uniform Resource Identifier (URI) scheme and replaces the very brief definition in Section 3.10 of RFC 1738.It defines a common syntax that is intended to interoperate across the broad spectrum of existing usages. At the same time, it notes some other current practices around the use of file URIs.The Base16, Base32, and Base64 Data EncodingsThis document describes the commonly used base 64, base 32, and base 16 encoding schemes. It also discusses the use of line-feeds in encoded data, use of padding in encoded data, use of non-alphabet characters in encoded data, use of different encoding alphabets, and canonical encodings. [STANDARDS-TRACK]The BagIt File Packaging Format (V0.97)This document specifies BagIt, a hierarchical file packaging format for storage and transfer of arbitrary digital content. A "bag" has just enough structure to enclose descriptive "tags" and a "payload" but does not require knowledge of the payload's internal semantics. This BagIt format should be suitable for disk-based or network-based storage and transfer. BagIt is widely used in the practice of digital preservation.The app: URL SchemeWidget URI schemeResearch Object Bundle 1.0Common-Workflow-Language/CWLviewer: CWL ViewerAn document store application has received a file
document.tar.gz which content will be checked for consistency.For sandboxing purposes it generates a UUID v4
32a423d6-52ab-47e3-a9cd-54f418a48571 using a pseudo-random generator.
The app base URI is thus app://32a423d6-52ab-47e3-a9cd-54f418a48571/The archive contains the files:./doc.html which links to css/base.css./css/base.css which links to ../fonts/Coolie.woff./fonts/Coolie.woffThe application generates the corresponding app URIs and uses those for URI resolutions:app://32a423d6-52ab-47e3-a9cd-54f418a48571/doc.html links
to app://32a423d6-52ab-47e3-a9cd-54f418a48571/css/base.cssapp://32a423d6-52ab-47e3-a9cd-54f418a48571/css/base.css` links to app://32a423d6-52ab-47e3-a9cd-54f418a48571/fonts/Coolie.woffapp://32a423d6-52ab-47e3-a9cd-54f418a48571/`fonts/Coolie.woffThe application is now confident that all hyperlinked files are
indeed present in the archive. In its database it notes which ZIP file
corresponds to 32a423d6-52ab-47e3-a9cd-54f418a48571.If the application had encountered a malicious hyperlink
../../../outside.txt it would first resolve it to
the absolute URI app://32a423d6-52ab-47e3-a9cd-54f418a48571/outside.txt and
conclude from the “Not Found” error that the path /outside.txt was not
present in the archive.A web crawler is about to index the content of the URL
http://example.com/data.zip and need to generate absolute URIs
as it continues crawling inside the individual resources of the archive.The application generates a UUID v5 based on the
URL namespace 6ba7b811-9dad-11d1-80b4-00c04fd430c8 and
the URL to the zip file:Thus the base app URI is app://b7749d0b-0e47-5fc4-999d-f154abe68065/ for
indexing the ZIP content, after which the crawler finds:app://b7749d0b-0e47-5fc4-999d-f154abe68065/app://b7749d0b-0e47-5fc4-999d-f154abe68065/pics/app://b7749d0b-0e47-5fc4-999d-f154abe68065/pics/flower.jpegWhen the application encounters http://example.com/data.zip some time later
it can recalculate the same base app URI. This time the ZIP file has been modified
upstream and the crawler finds additionally:app://b7749d0b-0e47-5fc4-999d-f154abe68065/pics/cloud.jpegIf files had been removed from the updated ZIP file the
crawler can simply remove those from its database,
as it used the same app base URI as in last crawl.An application where users can upload software distributions
for virus checking needs to avoid duplication as users
tend to upload foo-1.2.tar multiple times.The application calculates the sha-256 checksum of the uploaded
file to be 17edf80f84d478e7c6d2c7a5cfb4442910e8e1778f91ec0f79062d8cbdef42cd
in hexadecimal. The base64url encoding of the
binary version of the checksum is
F-34D4TUeOfG0selz7REKRDo4XePkewPeQYtjL3vQs0.The corresponding alg-val authority is thus
sha-256;F-34D4TUeOfG0selz7REKRDo4XePkewPeQYtjL3vQs0 meaning the
base app URL is app://sha-256;F-34D4TUeOfG0selz7REKRDo4XePkewPeQYtjL3vQs0/The crawler finds that it’s virus database already contain entries
for:app://sha-256;F-34D4TUeOfG0selz7REKRDo4XePkewPeQYtjL3vQs0/bin/eviland flags the upload as malicious without having to scan it again.An application is relating BagIt archives
on a shared file system, using structured
folders and manifests rather than individual archive files.The BagIt payload manifest /gfs/bags/scan15/manifest-md5.txt lists the files:The application generates a random UUID v4
ff2d5a82-7142-4d3f-b8cc-3e662d6de756 which it adds to
the bag metadata file /gfs/bags/scan15/bag-info.txtIt then generates app URIs for the files listed in the manifest:A virtual file system driver on a mobile operating system
has mounted several packaged application for resolving
common resources. An application requests the rendering
framework to resolve a picture from
app://eb1edec9-d2eb-4736-a875-eb97b37c690e/img/logo.png
to show it within a user interface.The framework first checks that the authority
eb1edec9-d2eb-4736-a875-eb97b37c690e is valid to access
according to the Same Origin policies or permissions of the
running application. It then matches the
authority to the corresponding application package.The framework then resolves /img/logo.png from within
that package, and returns an image buffer it already had
cached in memory.This Internet-Draft proposes the URI scheme app, which was originally
proposed by but never registered with IANA.
That W3C Note evolved from which
proposed the URI scheme widget.Neither W3C Notes did progress further as Recommendation track documents.While the focus of those W3C Notes was to specify how to resolve resources from
within a packaged application, this Internet-Draft generalize the app URI
scheme to support referencing and identifying resources within any archive, and
de-emphasize the retrieval mechanism.For compatibility with existing adaptations of the app URI scheme,
e.g. and , this Internet-Draft reuse the same
scheme name and remains compatible with the intentions of
, but renames “app” to mean
“Archive and Packaging Pointer” instead of “Application”.