INTERNET-DRAFT N. Ballou (Microsoft) Expires: December 1, 1997 B. Hernacki & B. Polk (Netscape) May 1, 1997 NNTP Full-text Search Extension 1. Status of this Memo This document is an Internet-Draft. Internet-Drafts are working docu- ments of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ds.internic.net (US East Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim). 2. Abstract This document describes a set of enhancements to the Network News Transport Protocol [NNTP-977] that allows full-text searching of news articles in multiple newsgroups. The proposed SEARCH command supports functionality similar to the [IMAP4] SEARCH command, minus user specific search keys (i.e., ANSWERED, DRAFT, FLAGGED, KEYWORD, NEW, OLD, RECENT, SEEN) and minus search keys based on headers that do not exist in news (i.e., CC, BCC, TO). The availability of the extensions described here will be advertised by the server using the extension negotiation-mechanism described in the new NNTP protocol specification currently being developed [NNTP-NEW]. Ballou [Page 1] INTERNET-DRAFT May 1, 1997 3. Introduction The NNTP SEARCH command is sent from the client to the server to specify and initiate a full-text search on articles in one or more newsgroups. The NNTP SEARCH command is a subset of the [IMAP4] SEARCH command, with user property and mail-specific header search keys not present in NNTP SEARCH. The results of an NNTP Search is OVER data as specified in [NNTP-NEW] for each article that satisfies the search criteria. In addition, the XPAT command is extended so that it can be used to full-text search articles within a single newsgroup. Both the headers and the body of the articles are searched. 3.1. New and Enhanced NNTP Commands There are four new NNTP commands, three new options to the existing LIST command, and enhancements to one existing command. * SEARCH * LIST SRCHFIELDS * LIST SEARCHABLE * XPAT The SEARCH command runs a one-time search, returning overview-like data. The LIST SRCHFIELDS command returns the fields that the server allows in full-text searches. The LIST SEARCHABLE command allows the client to determine which news- groups are full-text searchable. The XPAT command allows the pseudo-header ":TEXT". This specifies a full-text (headers and body) search of the articles in a single news- group. 4. Use of NNTP Extension Mechanism The NNTP extension mechanism allows a server to describe its capabili- ties. The following extensions are used to describe the capabilities described in this document. 4.1. SEARCH Extension The SEARCH extension means that the server supports the following com- mands: SEARCH, LIST SEARCHABLE, LIST SRCHFIELDS. Ballou [Page 2] INTERNET-DRAFT May 1, 1997 4.2. XPATTEXT Extension The XPATTEXT extension means that the server supports the :TEXT header in the XPAT command, as described by this document. 5. Command Descriptions 5.1. SEARCH Command Arguments: optional character set specification optional newsgroup specification searching criteria (one or more) Responses: 224 overview information follows 412 no news group selected 462 error performing search 501 command syntax error 502 no permission The SEARCH command searches the newsgroup for articles that match the given searching criteria. Searching criteria consist of one or more search keys. If there are articles that match the search criteria, the server responds with code 224 and returns OVER data for each matching article in a similar format as described in [NNTP-NEW]. The one change from [NNTP-NEW] OVER format is to change the article number field to a format that supports searches over multiple newsgroups. The article ID field for SEARCH OVER data will use the format newsgroup:art-ID rather than just an article number as defined in [NNTP-NEW]. A response of 421 indicates that there are no articles that match the search criteria. A response of 501 indicates a syntax error in the search criteria. A response of 502 indicates that the user does not have permission to search one or more of the specified newsgroups. If the search criteria did not specify a newsgroup, and there is no current newsgroup (i.e., set using the NNTP GROUP command), then the server returns the error code 412, indicating that no newsgroup has been specified. A response of 462 indicates that the server encountered an error when processing the search. When multiple keys are specified, the result is the intersection (AND function) of all the messages that match those keys. For example, the criteria FROM "SMITH" SINCE 1-Feb-1994 refers to all articles from Smith that were placed in the newsgroup since February 1, 1994. A search key may also be a parenthesized list of one or more search keys (e.g. for use with the OR and NOT keys). Server implementations MAY exclude [MIME-1] body parts with terminal content types other than TEXT and MESSAGE from consideration in SEARCH matching. Ballou [Page 3] INTERNET-DRAFT May 1, 1997 The optional character set specification consists of the word "CHARSET" followed by a registered MIME character set. It indicates the character set of the strings that appear in the search criteria. [MIME-2] strings that appear in RFC 822/MIME message headers, and [MIME-1] content transfer encodings, MUST be decoded before matching. Except for US-ASCII, it is not required that any particular character set be supported. If the server does not support the specified character set, a 462 error code is returned. The optional newsgroup specification consists of the word "IN" followed by either a wildcard character "*" - indicating a search over all newsgroups - or a list of newsgroup names separated by a comma. A newsgroup name can end with the wildcard string ".*" indicating a search over a sub-hierarchy of the newsgroup name space. If no newsgroup specification is given, the search is over the current newsgroup. If there is no current newsgroup, the server returns the 412 error code. In all search keys that use strings, a message matches the key if the string is a substring of the field. The matching is case-insensitive. The ON, BEFORE, and SINCE search criteria use the same date as used in the NNTP NEWNEWS command - the date the article arrived on the server. A server indicates support for the ON, BEFORE, and SINCE search criteria by listing :Date in the LIST SRCHFIELDS response. The defined search keys are as follows. Refer to the Formal Syntax section for the precise syntactic definitions of the arguments. Articles with article numbers corresponding to the specified range. ALL All Articles in the current newsgroup; the default initial key for ANDing. BEFORE Articles whose server arrival date is earlier than the specified date. BODY Articles that contain the specified string in the body of the message. FROM Articles that contain the specified string in the article structure's FROM field. HEADER Articles that have a header with the specified field-name (as defined in [RFC-822]) and that contains the specified string in the [RFC-822] field-body. Ballou [Page 4] INTERNET-DRAFT May 1, 1997 LARGER Articles with an size larger than the specified number of octets. NOT Articles that do not match the specified search key. ON Articles whose server arrival date is within the specified date. OR Articles that match either search key. SENTBEFORE Articles whose [RFC-822] Date: header is earlier than the specified date. SENTON Articles whose [RFC-822] Date: header is within the specified date. SENTSINCE Articles whose [RFC-822] Date: header is within or later than the specified date. SINCE Articles whose server arrival date is within or later than the specified date. SMALLER Articles with a size smaller than the specified number of octets. SUBJECT Articles that contain the specified string in the envelope structure's SUBJECT field. TEXT Articles that contain the specified string in the header or body of the message. Example: C: SEARCH FROM "Smith" SINCE 1-Feb-1994 S: 224 overview information follows S: comp.object:573 \t RE: object-oriented langs \t \ "John Smith" \t Sun, 03 Nov 1996 \ 14:25:05 -0800 \t <01cbc9d5f3c70$eab9a2cd@xyz.com> \ \t 4080 \t 33 S: . Note: each field in OVER response is separated by a tab - shown as a \t in the example above. Ballou [Page 5] INTERNET-DRAFT May 1, 1997 5.1.1. Search Formal Syntax The search query syntax is derived from the search syntax defined for the IMAP4 protocol. It is somewhat different because of the way inter- national character sets need to be encoded. The following syntax specification uses the augmented Backus-Naur Form (BNF) notation as specified in [RFC-822] Except as noted otherwise, all alphabetic characters are case- insensitive. The use of upper or lower case characters to define token strings is for editorial clarity only. Implementations MUST accept these strings in a case-insensitive fashion. astring ::= atom / string atom ::= 1*ATOM_CHAR ATOM_CHAR ::= atom_specials ::= "(" / ")" / SPACE / CTL / "*" / quoted_specials CHAR ::= CTL ::= date ::= date_text / <"> date_text <"> date_day ::= 1*2digit ;; Day of month date_month ::= "Jan" / "Feb" / "Mar" / "Apr" / "May" / "Jun" / "Jul" / "Aug" / "Sep" / "Oct" / "Nov" / "Dec" date_text ::= date_day "-" date_month "-" date_year date_year ::= 4digit digit ::= "0" / digit_nz digit_nz ::= "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" header_fld_name ::= sstring Ballou [Page 6] INTERNET-DRAFT May 1, 1997 mstring ::= A MIME-2 encoded string surrounded by double quotes newsgroup ::= atom [ ".*"] newsgroups ::= "*" / newsgroup_list newsgroup_list ::= newsgroup [ "," newsgroup_list] number ::= 1*digit ;; Unsigned 32-bit integer ;; (0 <= n < 4,294,967,296) nz_number ::= digit_nz *digit ;; Non-zero unsigned 32-bit integer ;; (0 < n < 4,294,967,296) QUOTED_CHAR ::= / "\" quoted_specials quoted_specials ::= <"> / "\" range ::= nz_number / nz_number "-" [ nz_number ] ;; Identifies a range of Articles. search ::= "SEARCH" SPACE ["CHARSET" SPACE astring SPACE] ["IN" SPACE newsgroups SPACE] 1#search_key ;; [CHARSET] MUST be registered with IANA search_key ::= "ALL" / "BODY" SPACE sstring / "FROM" SPACE sstring / "ON" SPACE date / "SINCE" SPACE date / "BEFORE" SPACE date / "SUBJECT" SPACE sstring / "TEXT" SPACE sstring / "HEADER" SPACE header_fld_name SPACE sstring / "LARGER" SPACE number / "NOT" SPACE search_key / "OR" SPACE search_key SPACE search_key / "SENTBEFORE" SPACE date / "SENTON" SPACE date / "SENTSINCE" SPACE date / "SMALLER" SPACE number / range / "(" 1#search_key ")" SPACE ::= 1* sstring ::= astring | mstring string ::= <"> *QUOTED_CHAR <"> TEXT_CHAR ::= Ballou [Page 7] INTERNET-DRAFT May 1, 1997 5.2. LIST SRCHFIELDS Command Arguments: none Responses: 224 data follws The LIST SRCHFIELDS command Returns a list of which fields can be specified in full-text search queries on the server. The response is a list of searchable fields, one per line. A "." on its own line terminates the list. The fields are either newsgroup headers, or non-header fields supported by the query syntax. The three currently defined non-header fields are ":Body", ":Text", and ":Date". ":Text" means all the searchable text in the article, and indicates that the "text" keyword is supported in the search query language. ":Body" means the body of the article, excluding the headers, and indicates that the "body" keyword is supported in the search query language. ":Date" means the date at which an article arrived on a server - similar to the date used in the NNTP NEWNEWS command - and indicates that the "ON", "SINCE", and "BEFORE" keywords are supported in the search query language. The "date", "text" and "body" search query fields are optional, but the server must indicate whether they are supported or not in the LIST SRCHFIELDS response. Example: C: LIST SRCHFIELDS S: 224 Data follows. S: From S: Date S: Subject S: :Text S: . 5.3. LIST SEARCHABLE Command Arguments: none Responses: 224 Data Follows The LIST SEARECHABLE command returns a list of strings that define which new groups are being indexed by the news server and are thus available for searching. In addition, the character sets allowed for each group is returned. Ballou [Page 8] INTERNET-DRAFT May 1, 1997 When there are newsgroups indexed it will return 224, followed by each portion of the tree that is indexed. If all groups are indexed, a line with "*" is returned. If only some parts of the newsgroup hierarchy are indexed, they are identified in the form .*. Clients should not assume that these will always be top level hierarchies. A "." on its own line terminates the list. The character sets allowed in full-text searches for each entry is also returned. The character sets are identified by the name as defined in [MIME-1]. Example: C: LIST SEARCHABLE S: 224 Data follows. S: alt.* US-ASCII S: comp.lang.* US-ASCII ISO-8859-1 ISO-8859-2 S: mcom.* ISO-8859-1 S: . 5.3 XPAT command enhancement Arguments: header range| pat [pat...] Responses: The XPAT command is enhanced in a simple way: The new value ":TEXT" will be supported as a header when invoking the command. The :TEXT header requests a full-text search the body and all headers of the specified articles. When :TEXT is specified for the header, only a single "pat" is allowed, and it must be a word or quoted string to search for, rather than a wildmat pattern as allowed otherwise. If :TEXT isn't specified as the header, the response is the same as it always has been for XPAT, with each result line containing the article number and the value of the header that matched the pattern. If the :TEXT header is specified, the constant string "TEXT" is returned in place of the value of the header that matched the pattern. Example: C: XPAT :TEXT 1000-2000 searchtext S: 221 Header follows S: 1021 TEXT S: 1024 TEXT S:. Ballou [Page 9] INTERNET-DRAFT May 1, 1997 6. Security Considerations The search commands must be implemented in a way that does not allow access to articles in newsgroups that a client is otherwise restricted from reading due to access control rules. Ballou [Page 10] INTERNET-DRAFT May 1, 1997 7. Bibliography [NNTP-977] Network News Transfer Protocol. B. Kantor, Phil Lapsley, Request for Comment (RFC) 977, February 1986. [NNTP-NEW] Network News Transfer Protocol. S. Barber INTERNET DRAFT, Sep- tember 1996. [IMAP4] IMAP4 INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4. M Crispin, Request for Comment (RFC) 1730, December 1994 [MIME-1] Borenstein N., and N. Freed, MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies, RFC 1521, Bellcore, Innosoft, September 1993. [MIME-2] Moore, K., MIME (Multipurpose Internet Mail Extensions) Part Two: Message Header Extensions for Non-ASCII Text, RFC 1522, University of Tennessee, September 1993. 8. Author's Address Nat Ballou Microsoft One Microsoft Way Redmond, WA 98052 USA Phone: +1 206-703-0574 Email: natba@microsoft.com This Internet Draft expires April xx, 1997. Ballou [Page 11]