INTERNET DRAFT F. Giudici, A. Sappia Category: Informational University of Genoa, Italy February 22, 1997 Expires August 22, 1997 An Extension to the Web Robots Control Method for supporting Mobile Agents Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents on the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress''. To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). 1. Abstract The Web Robots Control Standard [1] is a method for administrators of sites on the World-Wide-Web to give instructions to visiting Web robots. This document describes an extension for supporting Robots based on Mobile Agents, in a way that is independent of the technology used for their actual implementation. 2. Introduction Web Robots are Web client programs that automatically traverse the World Wide Web by retrieving a document and recursively retrieving all documents that are referenced. Robots are used for maintenance, indexing and search purposes. ``Classic'' Robots perform their job from the host from which they Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 1] INTERNET DRAFT An Extension to the Web Robots Control... Feb 22, 1997 have been launched; recent technologies offer the possibility of writing Robots that are able to physically move through the network, to operate within the website that hosts data being processed. Mobile Robots can lead to bandwidth and computational power savings, as well as to personalized search robots. A more detailed discussion of Mobile Robots pros and cons is out of the purposes of this document. Mobile Agents [5] is a technology that, among other things, allows the implementation of Mobile Robots. Mobile Agents are a computational paradigm in which programs can ``migrate'' from host to host, preserving their current state. To migrate through the Internet, Mobile Agents have to transfer data over the networks, for both their code and their internal data structures. On this purpose, they need a communication protocol. To receive and execute a Mobile Agent, a host must be equipped with a proper daemon that listen a port for incoming requests. Given the protocol name and the port number that the daemon is listening, addresses for Mobile Agents destinations can be written in form of a URL [2] as follows: :// : For instance, considering the Agent Transfer Protocol (ATP) [3] and given a fictional site www.fict.org, a valid address for dispatching a Mobile Agent could be atp://www.fict.org:434 3. Specification To control the way Robots can access a WWW site, a method is being currently used [1]. Simply speaking, the method states that a special document, named /robots.txt and whose MIME type is text/plain, should be available at the root of the website. Referring to the previous example, the URL of this document would be http://www.fict.org/robots.txt /robots.txt contains a list of records that describe in details which subtrees of the website are available for exploration by a given Robot and which are not. The format of these records is the following Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 2] INTERNET DRAFT An Extension to the Web Robots Control... Feb 22, 1997 one: ":" A typical example follows: User-agent: webcrawler Allow: / Disallow: /reserved The method specifications allow extensions to this structure, so new records can be added by just defining new tokens. 3.1. The-Mobile-agent-server record To control dispatching of Mobile Robots, a new record type is defined with the following form (the formal syntax is described in the next section): Mobile-agent-server: These records associate a well defined path on the website to the URL of a host that accepts Mobile Robots for exploring that path. More than one Mobile-agent-server line can be used, and in this case more recent lines always override older ones. Using multiple lines allows to assign different subtrees to different Mobile Agent capable hosts, or eventually to none. In the following example the website root (/) is not assigned to any host, while /dir1 and /dir1/dir2 are assigned to different targets: Mobile-agent-server: / none Mobile-agent-server: /dir1 atp://www.fict.org:544 Mobile-agent-server: /dir1/dir2 atp://www.fict.org:543 This mechanism is independent of the protocol and the programming language used for implementing the Mobile Robot. 3.2. Formal Syntax This is a BNF-like description of the Mobile-agent-server record line, using the conventions of RFC 822 [4], except that "|" is used to designate alternatives. Briefly, literals are quoted with "", parentheses "(" and ")" are used to group elements, optional elements Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 3] INTERNET DRAFT An Extension to the Web Robots Control... Feb 22, 1997 are enclosed in [brackets], and elements may be preceded with * to designate n or more repetitions of the following element; n defaults to 0. The Mobile Robot extension defines a new record line as follows: mobileagentrec = "Mobile-agent-server:" *space path *space (simplified_url | "none") simplified_url = scheme "://" net_loc scheme = 1*( alpha | digit | "+" | "-" | "." ) net_loc = *( pchar | ";" | "?" ) space = 1*(SP | HT) The simplified URL is a subcase of a URL as defined in RFC 1808 [2] and only designates a protocol, a network location and a port number. The syntax for "path" and other symbols are defined in RFC 1808 and reproduced here for convenience: path = fsegment *( "/" segment) fsegment = 1*pchar segment = *pchar pchar = uchar | ":" | "@" | "&" | "=" uchar = unreserved | escape unreserved = alpha | digit | safe | extra escape = "%" hex hex hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f" alpha = lowalpha | hialpha lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" safe = "$" | "-" | "_" | "." | "+" extra = "!" | "*" | "'" | "(" | ")" | "," Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 4] INTERNET DRAFT An Extension to the Web Robots Control... Feb 22, 1997 4. Examples This section contains an example of how an extended /robots.txt may be used. Let us suppose that a fictional site has the following URLs: http://www.fict.org/ http://www.fict.org/index.html http://www.fict.org/services/ http://www.fict.org/services/index.html http://www.fict.org/robots.txt http://www.fict.org/home/ http://www.fict.org/home/user1/ http://www.fict.org/home/user1/index.html http://www.fict.org/home/user2/ http://www.fict.org/home/user2/index.html http://www.fict.org/home/user3/ http://www.fict.org/home/user3/index.html Let be user1.fict.org and user2.fict.org two hosts equipped for receiving Mobile Agents, for example by means of the ATP protocol. The /robots.txt contains Mobile Agents directives as follows: Mobile-agent-server: / atp://www.fict.org:8001 Mobile-agent-server: /home/ none Mobile-agent-server: /home/user1/ atp://user1.fict.org:854 Mobile-agent-server: /home/user2/ atp://user2.fict.org:831 The following matrix shows if Mobile Agents are supported for indexing a given document, and on which host: URL HOST http://www.fict.org/index.html atp://www.fict.org:8001 http://www.fict.org/services/ atp://www.fict.org:8001 http://www.fict.org/services/index.html atp://www.fict.org:8001 http://www.fict.org/robots.txt atp://www.fict.org:8001 http://www.fict.org/home/ not available http://www.fict.org/home/user1/ atp://user1.fict.org:854 http://www.fict.org/home/user1/index.html atp://user1.fict.org:854 http://www.fict.org/home/user2/ atp://user1.fict.org:831 http://www.fict.org/home/user2/index.html atp://user1.fict.org:831 http://www.fict.org/home/user3/ not available http://www.fict.org/home/user3/index.html not available Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 5] INTERNET DRAFT An Extension to the Web Robots Control... Feb 22, 1997 5. Security considerations The Mobile-agent-server record can expose the existence of resources not otherwise linked to on the site, which may aid people guessing for URLs. If the exposed resource is the URL of a document, no further risks are induced other than those ones already implied by the standard mechanism. If the exposed resource is the URL of a site that can host Mobile Agents, security problems are to be dealt with at the site itself by means of a proper security model that should allow incoming Robots to only perform those operations needed for exploring the assigned website subtrees. However this is an issue related to the specific technology used for the implementation of the Mobile Robots and it is not to be discussed here. The same considerations about impersonation and encryption stated in the Standard Specification also apply here. 6. References [1] Koster, M. "A Standard for Robot Exclusion", http://info.webcrawler.com/mak/projects/robot/norobots.html, June 1994. [2] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform Resource Locators (URL)", RFC 1738, CERN, Xerox PARC, University of Minnesota, December 1994. [3] Lange, D. B., "Agent Transfer Protocol - ATP/0.1 Draft", IBM Tokyo Research Laboratory, http://www.trl.ibm.co.jp/aglets/atp/atp.htm, July 1996. [4] Crocker, D., "Standard for the Format of ARPA Internet Text Messages", STD 11, RFC 822, UDEL, August 1982. [5] Chang, D. T., and Lange, D. B., "Mobile Agents: A New Paradigm for Distributed Object Computing on the WWW", IBM Tokyo Research Laboratory, OOPSLA'96 Workshop "Toward the integration of WWW and Distributed Object Technology", http://www.trl.ibm.co.jp/aglets/atp/ma.html. Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 6] INTERNET DRAFT An Extension to the Web Robots Control... Feb 22, 1997 7. Authors' Addresses Fabrizio Giudici, fritz@dibe.unige.it, phone: +39-10-3532192 Andrea Sappia, sappia@dibe.unige.it, phone: +39-10-3532192 Electronic Systems and Networking Group Department of Biophysical and Electronic Engineering University of Genoa Via Opera Pia 11/a, 16145 - Genoa, ITALY Expires August 22, 1997 Giudici Sappia draft-giudici-web-robots-cntrl-00.txt [Page 7]