FIND Working Group J. Allen Internet Draft Bunyip Information Systems 19 November 1996 Expire in six months The Common Indexing Protocol (CIP) Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet- Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Abstract The Common Indexing Protocol (CIP) is used to pass indexing information from server to server in order to facilitate query routing. Query routing is the process of redirecting and replicating queries through a distributed database system towards servers holding the desired results. This document describes the CIP framework, including it's architecture and the protocol specifics of exchanging indices. 1. Introduction 1.1. History and Motivation The Common Indexing Protocol (CIP) is an evolution and refinement of distributed indexing concepts first introduced in the Whois++ Directory Service [RFC1834, RFC1835]. While indexing proved useful in that system to promote query routing, the centroid index object which is passed among Whois++ servers is specifically designed for template-based databases searchable by token-based matching. With alternative index objects, the index-passing technology will prove useful to many more application domains, not simply Directory Services and those applications which can be cast into the form of Allen [Page 1] Draft Common Indexing Protocol (CIP) 19 November 1996 template collections. The indexing part of Whois++ is integrated with the data access protocol. The goal in designing CIP is to extract the indexing portion of Whois++, while abstracting the index objects to apply more broadly to information retrieval. In addition, another kind of technology reuse has been undertaken by converting the ad-hoc data representations used by Whois++ into structures based on the MIME specification for structured Internet mail. Whois++ used a version number field in centroid objects to facilitate future growth. The initial version was "1". Version 1 of CIP (then embedded in Whois++, and not referred to separately as CIP) had support for only ISO-8895-1 characters, and for only the centroid index object type. Version 2 of the Whois++ centroid was used in the Digger software by Bunyip Information Systems to notify recipients that the centroid carried extra character set information. Digger's centroids can carry UTF-8 encoded 16-bit Unicode characters, or ISO-8859-1 characters, determined by a field in the headers. This specification is for CIP version 3 (CIPv3). Version 3 is a major overhaul to the protocol, though through a short negotiation sequence, CIP version 3 and earlier servers can interoperate in an index-passing mesh. 1.2 CIP's place in the Information Retrieval world CIP facilitates query routing. CIP is a protocol used between servers in a network to pass hints which make data access by clients at a later date more efficient. Query routing is the act of redirecting and replicating queries through a distributed database system towards the servers holding the actual results via reference to indexing information. CIP is a "backend" protocol -- it is implemented in and "spoken" among only network servers. These same servers must also speak some kind of data access protocol to communicate with clients. During query resolution in the native protocol implementation, the server will refer to the indexing information collected by the CIP implementation for guidance on how to route the query. Data access protocols used with CIP must have some provision for control information in the form of a referral. The syntax and semantics of these referrals are outside the scope of this specification. Allen [Page 2] Draft Common Indexing Protocol (CIP) 19 November 1996 2. Architecture 2.1 CIP in the Information Retrieval World 2.1.1 Information Retrieval in the Abstract In order to better understand how CIP fits into the information retrieval world, we need to first understand the unifying abstract features of existing information retrieval technology. Next, we discuss why adding indexing technology to this model results in a system capable of query routing, and why query routing is useful. An abstract view of the client/server data retrieval process includes data sets and data access protocols. An individual server is responsible for handling queries over a fixed domain of data. For the purposes of CIP, we call this domain of data the dataset. Clients make searches in the dataset and retrieve parts of it via a data access protocol. There are many data access protocols, each optimized for the data in question. For instance, LDAP and Whois++ are access protocols that reflect the needs of the directory services application domain. Other data access protocols include HTTP and Z39.50. 2.1.2 Indexing Information Facilitates Query Routing The above description reflects a world without indexing, where no server knows about any other server. In some cases (as with X.500 referrals, and HTTP redirects) a server will, as part of it's reply, implicate a peer server in the process of resolving the query. However, those servers generate replies based solely on their local knowledge. When indexing information is introduced into a server's local database, the server now knows not only answers based on the local dataset, but also answers based on external indices. These indices come from peer servers, via an indexing protocol. CIP is one such indexing protocol. Replies based on index information may not be the complete answer. After all, an index is not a replicated version of the remote dataset, but a possibly reduced version of it. Thus, in addition to giving complete replies from the local dataset, the server may give referrals to other datasets. These referrals are the core feature necessary for effective query routing. When CIP is used to pass indices from server to server, they make a kind of investment. At the cost of some resources to create, transmit and store the indices, query routing becomes possible. Query Routing is the process of replicating and moving a query closer to datasets which can satisfy the query. In some distributed systems, Allen [Page 3] Draft Common Indexing Protocol (CIP) 19 November 1996 widely distributed searches must be accomplished by replicating the query to all sub-datasets. This approach can be wasteful of resources both in the network, and on the servers, and is thus sometimes explicitly disabled. Using indexing in such a system opens the door to more efficient distributed searching. While CIP-equipped servers provide the referrals necessary to make query routing work, it's always the client's responsibility to collate, filter, and chase the referrals it receives. This gives the end-user (or agent, in the case that there's no human user involved in the search) greatest control over the query resolution process. The cost of the added client complexity is weighed against the benefits of total control over query resolution. In some cases, it may also be possible to decouple the referral chasing from the client by introducing a proxy, allowing existing simple clients to make use of query routing. Such a proxy would transparently resolve referrals into concrete results before returning them to the simple-minded client. 2.1.3 Abstracting the CIP index object As useful as indices seem, the fact remains that not all queries can benefit from the same type of index. For example, say the index consists of a simple list of keywords. With such an index, it is impossible to answer queries about whether two keywords were near one another, or if a keyword was present in a certain context (for instance, in the title). Because of the need for application domain specific indices, CIP index objects are abstract; they must be defined by a separate specification. The basic protocols for moving index objects are widely applicable, but the specific design of the index, and the structure of the mesh of servers which pass a particular type of index is dependent on the application domain. This document describes only the protocols for moving indices among servers. Companion documents describe initial index objects. 2.2 Architectural Details CIP implements index passing, providing the forward knowledge necessary to generate the referrals used for query routing. The core of the protocol is the index object. In the following sections, the structure of the index objects themselves is presented. Next, how and why indices are passed from server to server is discussed. Finally, the circumstances under which a server may synthesize an index object based on incoming ones are discussed. 2.2.1 The CIP Index Object Allen [Page 4] Draft Common Indexing Protocol (CIP) 19 November 1996 A CIP index object is composed of two parts, the attributes and the payload. In the attributes, metadata necessary to process and make use of the index object is transmitted. The actual index resides in the payload. Three particular attributes warrant specific mention at this point. The "type" of the index object selects one of many distinct CIP index object specifications which define exactly how the index blocks are to be created, parsed and used to facilitate query routing. Another attribute of note is the "DSI", or Dataset Identifier, which uniquely identifies the dataset from which the index was created. An attribute of the index object which is crucial for generating referrals is the "Base-URI". The URI (or URI's) contained in this attribute form the basis of any referrals generated based on this index block. The URI is also used as input during the index aggregation process to constrain the kinds of aggregation possible, due to multiprotocol constraints. The exact syntax of these attributes will be specified in the Protocol section, below. The payload is opaque to CIP itself. It is defined exclusively by the index object specification associated with the object's type attribute. Specifications on how to parse and use the payload are published separately as "CIP index object specifications". This abstract definition of the index object forms the basis of CIP's applicability to indexing needs across multiple application domains. A precise definition of the content and form of a CIP index block can be found in the Protocol section, below. 2.2.2 Moving Index Objects: How to Build a Mesh Indices are transmitted among servers participating in a CIP mesh. By distributing this information in anticipation of a query, efficient, accurate query routing is possible at the time a query arrives. A CIP mesh is a set of CIP servers which pass indices of the same type among themselves. Typically, a mesh is arranged in a hierarchical tree fashion, with servers nearer the root of the tree having larger and more comprehensive indices. However, a CIP mesh is explicitly allowed to have lateral links in it, and there may be more than one part of the mesh that has the properties of a "root". Mesh administrators are encouraged to avoid loops in the system, but they are not obliged to maintain a strict tree structure. Clients wishing to completely resolve all referrals they receive should protect against referral loops while attempting to traverse the mesh to avoid wasting time and network resources. See the section on "Navigating the Mesh" for a discussion of this. Allen [Page 5] Draft Common Indexing Protocol (CIP) 19 November 1996 All indices passed in a given mesh are assumed, as of this writing, to be of the same type (i.e. governed by the same CIP index object specification). It may be possible to create gateways between meshes carrying different index objects, but at this time that process is undefined and declared to be outside the scope of this specification. [ Note to reviewers: What to do if you get an index object you don't understand? Is it OK to drop it? Is it OK to ignore it for referral generation, but pass it through with the outgoing indices? ] Experience suggests that this index passing activity should take place among CIP servers as a parallel (and possibly lower-priority) job to their primary job of answering queries. Index objects travel among CIP servers by protocol exchanges explicitly defined in this document, not via the server's native protocol. This distinction is important, and bears repeating: Queries are answered (and referrals are sent) via the native data access protocol. Index objects are transferred via alternative means, as defined by this document. When two servers cooperate to move indexing information, the pair are said to be in a "polling relationship". The server that holds the data of interest, and generates the index is called the "polled server". The other server, which is the one that collects the generated index, is the "polling server". In a polling relationship, the polled server is responsible for notifying the polling server when it has a new index that the polling server might be interested in. In response, the polling server may immediately pick up the index object, or it may schedule a job to pick up a copy of the new index at a more convenient time. Historical Note: The term "polling" is a bit of a misnomer, since servers in a polling relationship do not actually periodically contact each other, just to see what has changed (as the word "poll" implies). The term is left over from earlier CIP versions, and fills a convenient need, so it has been left unchanged. Independent of the symmetric polling relationship, there's another way that servers can pass indices using CIP. In an "index pushing" relationship, a CIP server simply sends the index to a peer whenever necessary, and allows the receiver to handle the index object as it chooses. The receiving server may refuse it, may accept is, then Allen [Page 6] Draft Common Indexing Protocol (CIP) 19 November 1996 silently discard it, may accept only portions of it (by accepting it as is, then filtering it), or may accept it without question. The index pushing relationship is intended for use by dumb leaf nodes which simply want to make their index available to the global mesh of servers, but have no interest in implementing the complete CIP transaction protocol. It lowers the barriers to entry for CIP leaf nodes. For more information on participating in a CIP mesh in this restricted manner, see the section below on "Protocol Conformance". CIP index passing operations take place across a reliable transport mechanisms, including both TCP connections, and Internet mail messages. The precise mechanisms are described below in the section named "Protocol". Security concerns when passing indices are discussed in the "Protocol" section, and also in the "Security Considerations" section. CIP piggybacks off of security mechanisms available in the underlying transport medium. 2.2.3 Index Object Synthesis From the preceding discussion, it should be clear that indexing servers read and write index objects as they pass them around the mesh. However, a CIP server need not simply pass the inbound indices through as the outbound ones. While it's always permissible to pass an index object through to other servers, a server may choose to aggregate two or more of them, thereby reducing redundancy in the index, at the cost of longer referral chains. A basic premise of index passing is that even while collapsing a body of data into an index by lossy compression methods, hints useful to routing queries will survive in the resulting index. Since the index is not a complete copy of the original dataset, it contains less information. Index objects can be passed along unchanged, but as more and more information collects in the resulting index object, redundancy will creep in again, and it may prove useful to apply the compression again, by aggregating two or more index objects into one. This kind of aggregation should be performed without compromising the ability to correctly route queries while avoiding excessive numbers of missed results. The acceptable likelihood of false negatives must be established on a per-application-domain basis, and is controlled by the granularity of the index and the aggregation rules defined for it by the particular specification. However, when CIP is used in a multi-protocol application domain, such as Directory Service (with contenders including Whois++, LDAP, Allen [Page 7] Draft Common Indexing Protocol (CIP) 19 November 1996 and Ph), things get significantly trickier. The fundamental problem is to avoid forcing a referral chain to pass through part of the mesh which does not support the protocol by which that client made the query. If this ever happens, the client loses access to any hits beyond that point in the referral chain, since it cannot resolve the referral in its native data access protocol. This is a failure of query routing, which should be avoided. In addition to multi-protocol considerations, server managers may choose not to allow index object aggregation for performance reasons. As referral chains lengthen, a client needs to perform more transactions to resolve a query. As the number of transactions increases, so do the user-perceived delays, the system loads, and the global bandwidth demands. In general, there's a tradeoff between aggressive aggregation (which leads to reductions in the indexing overhead) and aggressive referral chain optimization. This tradeoff, which is also sensitive to the particular application domain, needs to be explored more in actual operational situations. Conceptually, a CIP index server has several index objects on hand at any given time. If it holds data in addition to indexing information, the server has an index object formed from its own data, called the "local index". It may have one or more indices from remote servers which it has collected via the index passing mechanisms. These are called "inbound indices". Implementor's Note: It may not be necessary to keep all of these structures intact and distinct in the local database. It is also not required to keep the outbound index (or indices) built and ready to distribute at all times. The previous paragraph merely introduces a useful model for expressing the aggregation rules. Implementors are free to model index objects internally however they see fit. The following two rules control how a CIP server formulates it's outgoing indices: 1. An index server may pass any of the index objects in its local index and its inbound indices through unchanged to polling servers. 2. If and only if the following three conditions are true, an index server can aggregate two or more index objects into a single new index object, to be added to the set of outbound indices. a. Each index object to be aggregated covers exactly the same set of protocols, as defined by the scheme component of the Base-URI's in each index object. Allen [Page 8] Draft Common Indexing Protocol (CIP) 19 November 1996 b. The index server supports every one of the data access protocols represented by the Base-URI's in the index objects to be aggregated. c. The specification for the index object type specified by the type attribute of the index objects explicitly defines the aggregation operation. The resulting index object must have Base-URI's characteristic of the local server for each protocol it supports. The outgoing objects should have the DSI of the local server. 3. Protocol In this section, the actual protocol for transmitting CIP index objects and maintaining the mesh is presented. While the preceding section (Architecture) describes the concepts involved, this section is the authoritative definition of the message formats and transfer mechanisms of CIP. 3.1 Philosophy The philosophy of the CIP protocol design is one of building-block design. Instead of relying on bulky protocol definition tools, or ad-hoc text encodings, CIP draws on existing, well understood Internet technologies like MIME, RFC-822, Whois++, FTP, and SMTP. Hopefully this will serve to ease implementation and consensus building. It should also stand as an example of a simple way to leverage existing Internet technologies to easily implement new application-level services. 3.2 MIME message exchange mechanisms CIP relies on interchange of standard MIME messages for all requests and replies. These messages are passed over a bidirectional, reliable transport system. Currently, only transport over reliable network streams (via TCP) or via the Internet mail infrastructure is supported. The CIP server which initiates the connection (conventionally referred to as a client) will be referred to below as the sender-CIP. The CIP server which accepts a sender-CIP's incoming connection and responds to the sender-CIP's requests is called a receiver-CIP. 3.2.1 The Stream Transport CIP messages are transmitted over bidirectional TCP connections via a simple text protocol. The transaction can take place over any TCP Allen [Page 9] Draft Common Indexing Protocol (CIP) 19 November 1996 port, as specified by the mesh configuration. There is no "well known port" for CIP transactions. All configuration information in the system must include both a hostname and a port. All sender-CIP actions (including requests, connection initiation, and connection finalization) are acknowledged by the receiver-CIP with a response code. See Appendix B for the format of these codes, a list of the responses a CIP server may generate, and the expected sender-CIP action for each. In order to maintain backwards compatibility with existing Whois++ servers, CIPv3 sender-CIPs must first verify that the newer protocol is supported. They do this by sending the following illegal Whois++ system command: "# CIP-Version: 3". On existing Whois++ servers implementing version 1 and 2 of CIP, this results in a 500- series response code, and the server terminates the connection. If the server implements CIPv3, it must instead respond with response code 300. Future versions of CIP can be correctly negotiated using this technique with a different string (i.e. "CIP-Version: 4"). An example of this short interchange is given below. Note: If a sender-CIP can safely assume that the server implements CIPv3, it may choose to send the "# CIP-Version: 3" string and immediately follow it with the CIPv3 request. This optimization, useful only in known homogeneous CIPv3 meshes, avoids waiting for the round-trip inherent in the negotiation. Once a sender-CIP has successfully verified that the server supports CIPv3 requests, it can send the request, formatted as a MIME message, using the network standard line ending: "". The message is terminated using SMTP-style message termination. The data is sent octet-for-octet, except when the pattern "." is seen, in which case the period is repeated, resulting in the following pattern: "..". When the data is finished, the octet pattern "." is transmitted to the receiver-CIP. On the receiver-CIP's side, the reverse transformation is applied, and the message read consists of all bytes up to, but not including, the terminating pattern. In response to the request, the receiver-CIP sends a response code, from either the 200, 400, or 500 series. The receiver-CIP then processes the request and replies, if necessary, with a MIME message. This reply is also delimited by an SMTP-style message terminator. After responding with a response code, the receiver-CIP must prepare to read another request message, resetting state to the point when the sender-CIP has just verified the CIP version. If the sender-CIP is finished making requests, it may close the connection. In response Allen [Page 10] Draft Common Indexing Protocol (CIP) 19 November 1996 the receiver-CIP must abort reading the message and prepare for a new sender-CIP connection (resetting it's state completely). An example is given below. In this (and all further examples) octets sent by the sender-CIP are preceded by ">>>" and those sent by the receiver-CIP by "<<<". Line endings are explicitly shown in angle- brackets; newlines in this text are added only for readability. Comments occur in curly-brackets. >>> { sender-CIP connects to receiver-CIP } <<< % 220 Example CIP server ready >>> # CIP-Version: 3 <<< % 300 CIPv3 OK! >>> Mime-Version: 1.0 >>> Content-type: application/cip-request; request="noop" >>> { This example uses the "noop" request. Receiver-CIPs must simply ignore this request. The actual text in the following request is: "This next line is only a dot..". } >>> The next line is only a dot: >>> .. >>> >>> . <<< % 200 Good MIME message received >>> { sender-CIP shuts down socket for writing } <<< % 222 Connection closing in response to sender-CIP shutdown <<< { receiver-CIP closes its side, resets, and awaits a new <<< sender-CIP } An example of an unsuccessful version negotiation looks like this: >>> { sender-CIP connects to receiver-CIP } <<< % 220 Whois++ server ready >>> # CIP-Version: 3 <<< % 500 Syntax error <<< { server closes connection } >>> { sender-CIP may attempt to retry using version 1 or 2 protocol instead. Sender-CIP may cache results of this unsuccessful negotiation to avoid later attempts. } 3.2.2 Internet mail infrastructure as transport As an alternative to TCP streams, CIP transactions can take place over the existing Internet mail infrastructure. There are two motivations for this feature of CIP. First, it lowers the barriers to Allen [Page 11] Draft Common Indexing Protocol (CIP) 19 November 1996 entry for leaf servers. When the need for a full TCP implementation is relaxed, leaf nodes (which, by definition, only send index objects) can consist of as little as a database and a indexing program (possibly written in a very high level language) to participate in the mesh. Second, it keeps with the philosophy of making use of existing Internet technology. The MIME messages used for requests and responses are, by definition of the MIME specification, suitable for transport via the Internet mail infrastructure. With a few simple rules, we open up an entirely different way to interact with CIP servers which choose to implement this transport. See Protocol Conformance, below, for details on what options server implementors have about supporting the various transports. The basic rhythm of request/response is maintained when using the mail transport. The following sections clarify some special cases which need to be considered for mail transport of CIP objects. In general, all mail protocols and mail format specifications (especially MIME Security Multiparts) can be used with the CIP mail transport. [ Note to reviewers: What about version negotiation for mail transport? Should we add a CIP-Version header? ] 3.2.2.1 Return path When CIP transactions take place over a bidirectional stream, the return path for errors and results is implicit. Using mail as a transport introduces difficulties to the recipient, because it's not always clear from the headers exactly where the reply should go, though in practice there are some heuristics used by MUA's. CIP solves this problem by fiat. CIP requests sent using the mail transport must include a Reply-To header as specified by RFC-822. Any mail received for processing by a CIP server implementing the mail transport without a Reply-To header must be ignored, and a message should be logged for the local administrator. The receiver must not attempt to reply with an error to any address derived from the incoming mail. If under no circumstances is a response to be sent to a CIP request, the sender should include a Reply-To header with the address "<>" in it. Receivers must never attempt to send replies to that address, as it is defined to be invalid (both here, and by the BNF grammar in RFC-822). It should be noted that, in general, it is a bad idea to turn off error reporting in this way. However, in the simplest case of an index pushing program, this may be a desirable simplification. Allen [Page 12] Draft Common Indexing Protocol (CIP) 19 November 1996 3.2.2.2 Response format As with the stream transport, all requests must be followed by a response code, except in the cases noted in section 3.2.2.1, when the return path is unavailable. The response takes the form of a MIME multipart/mixed message with one or two parts. The first part must be of type "application/cip- response". The definition for this MIME type is: MIME type name: application MIME subtype name: cip-response-code Required parameters: code Optional parameters: none Security considerations: none The "code" parameter carries the response code, in the same format as that used over the stream connection, without the trailing "". See Appendix B for more information on response codes. If the request results in a response MIME object, this object is the second object in the multipart/mixed MIME object. 3.2.2.3 Large objects The Internet mail infrastructure has built-in limits on message size, to make reliable delivery efficient and implementable. It's conceivable that CIP responses will outgrow these limits, based on the content and nature of index objects. Thus, the use of the MIME message/partial message segmentation protocol is suggested for large objects. Such treatment is not required by this specification, but it is suggested for CIP objects which, in their ready-to-mail form, will exceed 200 kilobytes in size. Senders which do not adhere to this recommendation risk having their messages truncated, returned as undeliverable, or otherwise mis-handled. If a CIP server implementing the mail transport receives a multipart/partial, but does not have the proper routines or resources to reassemble the entire message, it must return a response code of "500", as specified in section 3.2.2.2, unless prohibited by the cases in enumerated in section 3.2.2.1. 3.3 CIP Transactions Messages passed by CIP implementations over the reliable transport Allen [Page 13] Draft Common Indexing Protocol (CIP) 19 November 1996 mechanism fall into two categories, requests and responses. Not all requests result in a CIP response. All requests are at least acknowledged by an appropriate response code. Both requests and responses are formatted as MIME messages. The specific MIME types involved are defined below. As with all MIME objects, CIP messages may be wrapped in a security multipart package to provide authentication and privacy. The security policy with respect to all messages is implementation defined, when not explicitly discussed below. CIP implementors are strongly urged to allow server administrators maximum configurability to secure their servers against malicious anonymous CIP messages. In general, operations which can permanently change the server's state in a harmful way should only take place upon receipt of a properly signed message from a trusted CIP peer or administrator. Implementors should provide appropriate auditing capabilities so that both successful and failed requests can be tracked by the server administrator. 3.3.1 CIP Requests A CIP request either initiates an index transfer, interrogates the state of the receiver-CIP (or the server's participation in the mesh), or changes the state of the server (or the server's place in the mesh). CIP requests are sent as a MIME message of type "application/cip- request". The definition for this MIME type follows: MIME type name: application MIME subtype name: cip-request Required parameters: request Optional parameters: type, dsi Security considerations: (See Section 6.2) In the following sections, the server's response for each possible value for "request" is defined. Note that the parameters listed as optional above are only optional with respect to the generic MIME form. The parameters listed above as optional are only optional with respect to MIME parsing. If one or more of the parameters needed to fulfill a request is missing, a response code of 502 is returned. Extra optional parameters which are unrecognized must be silently ignored. 3.3.1.1 No-operation Request Name: noop Allen [Page 14] Draft Common Indexing Protocol (CIP) 19 November 1996 Required parameters: (none) A CIP request with the "request" parameter set to "noop" must be acknowledged with response code 200 (request OK, no response forthcoming). This request must not require a signed MIME object. Implementations should accept requests which have been validly signed. 3.3.1.2 Poll Request Name: poll Required parameters: type, dsi The "poll" request is used by a poller to request the transfer of an index object. It requires the following parameters: type: The index object type requested dsi: The dataset which the index should cover If there are no index objects available for the given DSI, or the receiver-CIP does not support the given index object type, the receiver-CIP must respond with response code 200, (successful, no response forthcoming). Otherwise, the response code must be 201 (successful, response is forthcoming). The security policy with respect to polling requests is wholly implementation defined. Implementations may be configured to accept or reject anonymous poll requests. 3.3.1.[3456...] More requests to go here... [ Note to reviewers: need to add some mesh-management and statistics gathering requests here, still. ] 3.3.2 CIP responses A CIP response is sent by a receiver-CIP in response to certain requests. The response must be preceded by a response code of 201, which must be interpreted by sender-CIPs to mean, "request was successful, response object follows". 3.3.2.1 Index Object set In reply to the "poll" request, a server may choose to send one or more index objects. Regardless of the number of index objects returned, the response must take the form of a MIME multipart/mixed message. Each part must itself be a MIME object of type Allen [Page 15] Draft Common Indexing Protocol (CIP) 19 November 1996 "application/cip-index-object". The definition for this type follows: MIME type name: application MIME subtype name: cip-index-object Required parameters: type, dsi, base-uri Optional parameters: none Security considerations: (See Section 6) As previously described, an index object consists of several parameters, followed by an opaque payload, which only has meaning within the context of a particular index object type specification. This opaque payload is carried in the body of the "application/cip- index-object" MIME object. The required parameters are to be used as follows: type: Specifies what index object specification should be used when attempting to parse, use, and aggregate this object. DSI: The DSI is a string which globally uniquely identifies the dataset from which the index was created. base-URI: One or more URI's which will form the base of any referrals created based upon this index object. 3.4 Protocol Element Definition In this section, specific details of the syntax and semantics of specific parts of CIP transactions are defined. Because all of these elements are transmitted as MIME parameters in the Content-type header, the usual conventions for header encoding (RFC-1522) must be followed. In particular, each of these elements can easily extend past the 75 character limit imposed by RFC-1522 on encoded-words, and should be treated accordingly. 3.4.1 The Dataset Identifier (DSI) A dataset identifier is an identifier chosen from any part of the ISO/CCITT OID space. The DSI uniquely identifies a given dataset among all datasets indexed by CIP. This uniqueness requirement is crucial to allow clients to avoid referral loops during the query resolution process. As currently defined, OID's are an unbounded sequence of unbounded Allen [Page 16] Draft Common Indexing Protocol (CIP) 19 November 1996 integers. While this creates an infinite numbering space, it presents problems for implementors dealing with machines with finite resources. To ease implementation, this document specifies an ASCII encoding of the OID, and specifies limits which make implementation easier. For the purposes of interchange in CIP messages, an OID must conform to the following rules: dsi = integer *( "." integer) integer = all-digits / (one-to-nine *all-digits) one-to-nine = "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" all-digits = "0" / one-to-nine Under no circumstances shall the total length of the resulting string exceed 255 characters. OID's which cannot, due to their length, conform to these rules must not be used as CIP dataset identifiers. An implementation must not attempt to parse the individual integers unless it is prepared to handle arbitrary-length integers. Treating the DSI as anything other than an opaque string of US-ASCII characters is not recommended. Two CIP DSI's are considered to match if both conform to the above rules and every octet matches. 3.4.2 DSI Descriptions A DSI description can be used as an optional parameter anywhere a DSI may appear in a CIP transaction. The DSI description is a short, human-readable description of the meaning of the particular DSI. Under no circumstances should implementors expect to receive a DSI description along with a DSI. A DSI description must conform to the following rules: dsi-desc = 255*octet octet = Note that no character set or character set encoding is explicitly specified for this parameter. Implementors who choose to use any character set other than US-ASCII (the only acceptable character set in unencoded RFC-822 headers) must make use of the encodings specified in RFC-1522. Note also that certain octets, interpreted as US-ASCII characters, are illegal in their unencoded form in headers. To avoid complex encoding issues, implementors may choose to restrict Allen [Page 17] Draft Common Indexing Protocol (CIP) 19 November 1996 local DSI descriptions to only those US-ASCII characters legal in RFC-822 headers. However, they must still correctly receive and transmit properly encoded DSI descriptions holding arbitrary octets. 3.4.3 Base URI's CIP index objects carry base-URI's to facilitate referral generation based on the index object. The base-URI parameter carries a whitespace-delimited list of URL's. URL's are defined in RFC-1738. The exact rules are as follows: base-uri = genericurl *( 1*whitespace genericurl ) whitespace = "" (decimal 32) / "" (decimal 9) / "" (decimal 13) / "" (decimal 10) genericurl = { as specified in RFC-1738, section 5 } 3.5 Conformance There are two levels of conformance to this specification. An implementation which correctly implements all required parts of the protocol is called "a fully conformant Common Indexing Protocol server". An implementation that supports only the index-pushing relationship is called a "simple CIP leaf server". A fully conformant implementation need only support one of the previously specified CIP transports, though it's permissible and encouraged to create implementations which accept both kinds of input (from network streams, and via the Internet mail infrastructure). A server which is only a simple CIP leaf server need only support one transport, and need only generate the CIP index object response (see section 3.3.2.1). When using the mail transport, a simple CIP leaf server may choose to disable error reporting by specifying the null Reply-To address as discussed in section 3.2.2.1. 4. Extensibility This protocol is designed to be extensible to accommodate changes in the indexing needs within the Internet. This document specifies version 3 of CIP. Through the simple version negotiation discussed in section 3.2.1, future versions of CIP can be made to interoperate. In addition to this form of future growth, CIP index objects are defined in an abstract way to allow for multiple interoperating indexing formats and algorithms. An index object is defined in the abstract by this document. Each specific index object specification Allen [Page 18] Draft Common Indexing Protocol (CIP) 19 November 1996 is defined by an external Internet Draft or RFC. In the following sections, the minimum requirements for such a document are discussed, and an example is given. 4.1 Required Coverage A CIP index object specification must, at minimum, define the following parameters, formats, and processes with respect to the new index object in question. 4.1.1 Object Type The index object must be given a unique type among all other reserved types. Reserved types are those previously documented by other CIP index object specifications, according to standard IETF processes. An object type name consists of from 1 to 20 characters from the set [a-zA-Z0-9-]; that is, all upper and lower case letters, all digits, and the ASCII minus character (decimal 45). Though type names may be specified case sensitively, they must be compared and otherwise processed case insensitively. A new name must be assigned when any changes to the document describing the index object type are not completely backwards compatible. Designers are advised to pick an initial name ending in "-1", so that future versions may be easily differentiated by simply incrementing the suffix number. 4.1.2 Payload Format The format of the payload must be defined in sufficient detail to allow implementors of peer servers to successfully implement a parser for the index format. CIP payloads are considered opaque for transport, but they must nonetheless conform to the specification for a MIME body, since they are packed into MIME containers for transport. Because CIP itself uses MIME to manage the data in CIP transactions, designers should seriously consider using MIME internally to the payload too. In this case, the object specification document must clearly state that the payload should be treated as a stream of data with MIME type "message/rfc822", and thus should be fed back into the MIME parser once it has been extracted. In addition, the document should describe the structure and type of the payload's MIME objects. 4.1.3 Matching Semantics The document must describe the semantics of the entities in the Allen [Page 19] Draft Common Indexing Protocol (CIP) 19 November 1996 payload with respect to matching. This allows a CIP server which is attempting to make use of the indexing material in an index object to decide what referrals (if any) to generate in response to a query. [ Note to reviewers: This section is really hard to nail down... it needs to be vague because of the abstract notion of "indexing data" and "match".] 4.1.4 Aggregation Semantics The document must either specifically disallow aggregation, or it must clearly describe the process of aggregating two indices of the specified type into a single index object. 4.1.5 Security Considerations As is customary with Internet protocol documentation, a brief review of security implications of the proposed object must be included. This section may need to do little more than echo the considerations expressed in this documents Security Considerations section. 4.2 Optional Coverage Because indexing algorithms, stop-lists, and data reduction technologies are considered by some index object designers to be proprietary, it is not necessary to discuss, the process used to derive indexing information from a body of source material. When proprietary indexing technologies are used in a public mesh, all CIP servers in the mesh should be able to parse the index object (and perform aggregation operations, if necessary), though not all of them need to be able to create these proprietary indices from source data. Thus, index object designers may choose to remain silent on the algorithms used for the generation of indices, as long as they adequately document how to participate in a mesh of servers passing these proprietary indices. Designers should also seriously consider including useful examples of source data, the generated index, and the expected results from example matches. When the aggregation algorithm is complex, including a table showing two indices and the resultant aggregate index is recommended. 4.2 Example: The Token List The following sections outline an example index object specification. The "Token-List-1" object is not intended for production use. It is merely documented here as an example of the kind of specification needed to introduce a new index object type. Allen [Page 20] Draft Common Indexing Protocol (CIP) 19 November 1996 4.2.1 Introduction This document describes the Token List CIP index object type. It is intended to allow simple token searches on unstructured bodies of text. 4.2.2 Name The index object described below will have the CIP index object type of "Token-List-1". 4.2.3 Payload format A Token List takes the form of a MIME message of type "text/plain". If the MIME charset parameter is present, its value must be "us- ascii". A Content-Transfer-Encoding header may be present to specify one of the valid MIME encodings for the body. The tokens themselves are listed, one at a time, on separate lines of the body of the MIME object. No token may exceed 75 characters in length. Tokens are transferred in all lowercase letters, though robust implementations should verify and correct mixed-case tokens on input. No token will appear in the index more than once. 4.2.4 Tokenization Algorithm A contiguous piece of text is broken into tokens using the following algorithm: while (more text) { skip all characters that are not token-characters mark token start skip all characters that are token-characters mark token end emit the token bounded by (start, end) } The class "token-characters" is defined as all ASCII characters between 65 and 90 (A-Z), between 97 and 122 (a-z), and between 48 and 57 (0-9). All of the preceding ranges are inclusive, and all are in decimal. 4.2.5 Matching Semantics A token should be considered to match a query if a case insensitive, character-by-character comparison is positive. For the purposes of Allen [Page 21] Draft Common Indexing Protocol (CIP) 19 November 1996 this specification, case insensitive is defined by the following two rules. For decimal ASCII codes 65 through 90 (inclusive), the character is considered to match itself, and also to match itself plus 32. For all other ASCII codes, the character is considered to match itself only. 4.2.6 Aggregation Semantics Two or more token-list index objects can be aggregated by merging the lists of tokens and removing duplicates. All matches performed to remove the duplicates should be performed according to section 4.2.5. 5. Navigating the mesh With the CIP infrastructure in place to manage index objects, the only problem remaining is how to successfully use the indexing information to do efficient searches. CIP facilitates query routing, which is essentially a client activity. A client connects to one server, which redirects the query to servers "closer to" the answer. This redirection message is called a referral. 5.1 The Referral The concept of a referral and the mechanism for deciding when they should be issued is described by CIP. However, the referral itself must be transferred to the client in the native protocol, so its syntax is not directly a CIP issue. The mechanism for deciding that a referral needs to be made resides in the CIP implementation in the server. The mechanism for generating and sending that referral to the client resides in the server's native protocol implementation. A referral is made when a search against the index objects held by the server shows that there may be hits available in one of the datasets represented by those index objects. There may be no more than one referral per dataset. If there is more than one index object (each of a different type) for the same dataset, only one of them will generate a referral. Though the format of the referral is dependent on the native protocol of the CIP server, the baseline contents of the referral are constant across all protocols. At the least, a DSI and a URI must be returned. The DSI is the DSI associated with the dataset which caused the hit. This must be presented to the client so that it can avoid referral loops. The Base-URI parameter which travels along with index objects is used to provide the other required part of a referral. The additional information in the Base-URI may be necessary for the server receiving the referred query to correctly handle it. A good Allen [Page 22] Draft Common Indexing Protocol (CIP) 19 November 1996 example of this is an LDAP server, which needs a base X.500 distinguished name from which to search. When an LDAP server sends a centroid-format index object up to a CIP indexing server, it sends a Base-URI along with the name of the X.500 subtree for which the index was made. When a referral is made, the Base-URI is passed back to the client so that it can pass it to the original LDAP server. [ Note to reviewers: of course that's all speculative ] As usual, in addition to sending the DSI, a DSI-Description attribute can be optionally sent. Because a client may attempt to check with the user before chasing the referral, and because this string is the friendliest representation of the DSI that CIP has to offer, it should be included in referrals when available (i.e. when it was sent along with the index object). 5.2 Cross-protocol Mappings Each data access protocol which uses CIP will need a clearly defined set of rules to map queries in the native protocol to searches against an index object. These rules will vary according to the data domain. In principle, this could create a bit of a scaling difficulty; for N protocols and M data domains, there would be N x M mappings required. In practice, this should not be the case, since some access protocols will be wholly unsuited to some data domains. Consider for example, a LDAP server trying to make a search in an index object composed from Web pages. What would the results be? How would the client make sense of the results? However, as pre-existing protocols are connected to CIP, and as new ones are developed to work with CIP, this issue must be examined. In the case of Whois++ and the CENTROID index type, there is an extremely close mapping, since the two were designed together. When hooking LDAP to the CENTROID index type, it will be necessary to map the attribute names used in the LDAP system to attribute names which are already being used in the CENTROID mesh. It will also be necessary to tokenize the LDAP queries under the same rules as the CENTROID indexing policy, so that searches will take place correctly. These application- and protocol-specific actions must be specified in the index object specification, as discussed in section 4, "Extensibility". 5.3 Moving through the mesh From a client's point of view, CIP simply pushes all the "hard work" onto its shoulders. After all, it's the client which needs to track down the real data. While this is true, it's very misleading. Because the client has control over the query routing process, the Allen [Page 23] Draft Common Indexing Protocol (CIP) 19 November 1996 client has total control over the size of the result set, the speed with which the query progresses, and the depth of the search. The simplest client implementation simply provides referrals to the user in a raw, ready-to-reuse form, without attempting to follow them. For instance, one Whois++ client, which interacts with the user via a Web-based form, simply makes referrals into HTML hypertext links. Encoded in the link via the HTML forms interface GET encoding rules is the data of the referral: the hostname, port, and query. If a user chooses to follow the referral link, they execute a new search on the new host. A more savvy client might present the referrals to the user and ask which should be followed. And, assuming appropriate limits were placed on search time, and bandwidth usage, it might be reasonable to program a client to follow all referral automatically. When following all referrals, a client must show a bit of intelligence. Remember that the mesh is defined as an interconnected graph of CIP servers. This graph may have cycles, which could cause an infinite loop of referrals, wasting the servers' time and the client's too. When faced with the job of tacking down all referrals, a client must use some form of a mesh traversal algorithm. Such an algorithm has been documented for use with Whois++ in RFC-1914. The same algorithm can be easily used with this version of CIP. In Whois++ the equivalent of a DSI is called a handle. With this substitution, the Whois++ mesh traversal algorithm works unchanged with CIP. Finally, the mesh entry point (i.e. the first server queried) can have an impact on the success of the query. To avoid scaling issues, it is not acceptable to use a single "root" node, and force all clients to connect to it. Instead, clients should connect to a reasonably well connected (with respect to the CIP mesh, not the Internet infrastructure) local server. If no match can be made from this entry point, the client can expand the search by asking the original server who polls it. In general, those servers will have a better "vantage point" on the mesh, and will turn up answers that the initial search didn't. The mechanism for dynamically determining the mesh structure like this exists, but it not documented here for brevity. See RFC-1913 for more information on the POLLED-BY and POLLED-FOR commands. [ Note to reviewers: This is a problem; defining mesh structure queries are intermingle CIP and native protocol issues. ] 6. Security Considerations In this section, we discuss the security considerations necessary when making use of this specification. There are at least two levels Allen [Page 24] Draft Common Indexing Protocol (CIP) 19 November 1996 at which security considerations come into play. Indexing information can leak undesirable amounts of proprietary information, unless carefully controlled. At a more fundamental level, the CIP protocol itself requires external security services to operate in a safe manner. Both topics are covered below. 6.1 Secure Indexing CIP is designed to index all kinds of data. Some of this data might be considered valuable, proprietary, or even highly sensitive by the data maintainer. Take, for example, a human resources database. Certain public bits of data, in moderation, can be very helpful for a company to make public. However, the database in its entirety is a very valuable asset, which the company must protect. Much experience has been gained in the directory service community over the years as to how best to walk this fine line between completely revealing the database and making useful pieces of it available. Another example where security becomes a problem is for a data publisher who'd like to participate in a CIP mesh. The data that publisher creates and manages is the prime asset of the company. There is a financial incentive to participate in a CIP mesh, since exporting indices of the data will make it more likely that people will search your database. (Making profit off of the search activity is left as an exercise to the entrepreneur.) Once again, the index must be designed carefully to protect the database while providing a useful synopsis of the data. One of the basic premises of CIP is that data providers will be willing to provide indices of their data to peer indexing servers. Unless they are carefully constructed, these indices could constitute a threat to the security of the database. Thus, security of the data must be a prime consideration when developing a new index object type. The risk of reverse engineering a database based only on the index exported from it must be kept to a level consistent with the value of the data and the need for fine-grained indexing. 6.2 Protocol Security CIP protocol exchanges, taking the form of MIME messages, can be secured using any technology available for securing MIME objects. In particular, use of RFC-1847's Security Multiparts are recommended. A solid application of RFC-1847 using widely available encryption software is PGP/MIME, RFC-2016. Implementors are encouraged to support PGP/MIME, as it is the first viable application of the MIME Security Multiparts architecture. As other technologies become available, they may be incorporated into the CIP mesh. Allen [Page 25] Draft Common Indexing Protocol (CIP) 19 November 1996 If an incoming request does not have a valid signature, it must be considered anonymous for the purposes of access control. Servers may choose to allow certain requests from anonymous peers, especially when the request cannot cause permanent damage to the local server. In particular, answering anonymous poll requests encourages index builders to poll a server, making the server's resources better known. The explicit security policy with respect to incoming requests is outside the scope of this specification. Implementors are free to accept or reject any request based on the security attributes of the incoming message. When a request is rejected due to authentication reasons, a response code from the 530 series must be issued. Acknowledgments Thanks to the many helpful members of the FIND working group for discussions leading to this specification. Author's Address Jeff R. Allen Bunyip Information Systems, Inc. 310 Ste-Catherine West, Suite 300 Montreal, Quebec H2X 2A1 Canada Phone: +1-514-875-8611 EMail: jeff@bunyip.com Appendix A: Glossary application domain: A problem domain to which CIP is applied which has indexing requirements which are not subsumed by any existing problem domain. Separate application domains require separate index object specifications, and potentially separate CIP meshes. See index object specification. centroid: An index object type used with Whois++. In CIP versions before version 3, the index was not extensible, and could only take the form of a centroid. A centroid is a list of (template name, attribute name, token) tuples with duplicates removed. Allen [Page 26] Draft Common Indexing Protocol (CIP) 19 November 1996 dataset: A collection of data (real or virtual) over which an index is created. When a CIP server aggregates two or more indices, the resultant index represents the index from a "virtual dataset", spanning the previous two datasets. Dataset Identifier: An identifier chosen from any part of the ISO/CCITT OID space which uniquely identifies a given dataset among all datasets indexed by CIP. DSI: See Dataset Identifier. DSI-description: A human readable string optionally carried along with DSI's to make them more user-friendly. See dataset Identifier. index object: The embodiment of the indices passed by CIP. An index object consists of some control attributes and an opaque payload. index object specification: A document describing an index object type for use with the CIP system described in this document. See index object and payload. index pushing: The act of presenting, unsolicited, an index to a peer CIP server. MIME: see Multipurpose Internet Mail Extensions Multipurpose Internet Mail Extensions: A set of rules for encoding Internet Mail messages that gives them richer structure. CIP uses MIME rules to simplify object encoding issues. MIME is specified in RFC-1521 and RFC-1522. payload: The application domain specific indexing information stored inside an index object. The format of the payload is specified externally to this document, and depends on the type of the containing index object. Allen [Page 27] Draft Common Indexing Protocol (CIP) 19 November 1996 PGP/MIME: A method of using PGP to provide security services for MIME objects via the use of Security Multiparts. See also Security Multiparts. polled server: A CIP server which receives a request to generate and pass an index to a peer server. polling server: A CIP server which generates a request to a peer server for its index. referral chain: The set of referrals generated by the process of routing a query. See query routing. query routing: Based on reference to indexing information, redirecting and replicating queries through a distributed database system towards the servers holding the actual results. Security Multiparts: An architecture for securing MIME objects specified in RFC-1847. Security multiparts allow MIME objects to be signed and/or encrypted. Appendix B: Response Codes CIP response codes use Whois++ syntax due to backwards compatibility demands. The format is: response-code = '%' '' three-digits '' (* comment-chars) '' '' three-digits = { 3 digit positive decimal integer } comment-chars = { all US-ASCII characters except for and } Note that even in the case that there are zero comment characters, the 3 digit code must be followed by the characters "". In no case may the entire line exceed 255 characters. Below are several examples: % 220 CIP Server v1.0 ready! % 500 MIME formatting problem % 500 Allen [Page 28] Draft Common Indexing Protocol (CIP) 19 November 1996 Whois++ response codes allow continuation via a minus character ('-') in the sixth position. CIP response codes do not allow this feature. CIP clients which wish to maintain maximum interoperability with Whois++ should correctly handle these continuations during the version negotiation described in section 3.2.1 above. CIPv3 servers must never generate these multi-line response codes. The meaning of the various digits in the response codes is discussed in RFC-821, Appendix E. The following response codes are defined for use by CIPv3 servers. Implementors must use these exact codes; undefined codes should be interpreted by CIP servers as fatal protocol errors. Instead of defining new codes for unforseen situations, implementors must adapt one of the given codes. The implementation should attach a useful alternative comment to the reused response code. Code Suggested description text Sender-CIP action -------------------------------------------------------- 220 Initial server banner message Continue with Whois++ interaction, or attempt CIP version negotiation. 300 Requested CIP version accepted Continue with CIP transaction, in the specified version. 222 Connection closing (in response to sender-CIP close) Done with transaction. 200 MIME request received and processed Expect no output, continue session (or close) 201 MIME request received and processed, output follows Read a response, delimited by SMTP-style message delimiter. 400 Temporarily unable to process request Retry at a later time. May be used to indicate that the server does not currently have the resources available to accept an index. 500 Bad MIME message format Retry with correctly formatted MIME request. 501 Unknown or missing request in application/cip-request Retry with correct CIP command. Allen [Page 29] Draft Common Indexing Protocol (CIP) 19 November 1996 502 Request is missing required CIP attributes Retry with correct CIP attributes. 520 Aborting connection for some unexpected reason Retry and/or alert local administrator. 530 Request requires valid signature Sign the request, if possible, and retry. Otherwise, report problem to the administrator. 531 Request has invalid signature Report problem to the administrator. 532 Cannot check signature Alert local administrator, who should cooperate with remote administrator to diagnose and resolve the problem. (Probably missing a public key.) Allen [Page 30]