Robots Exclusion Protocol

Internet-Draft	REP	May 2022
Koster, et al.	Expires 6 November 2022	[Page]

Abstract

This document specifies and extends the "Robots Exclusion Protocol" method originally defined by Martijn Koster in 1996 for service owners to control how content served by their services may be accessed, if at all, by automatic clients known as crawlers.¶

2. Specification

2.1. Protocol Definition

The protocol language consists of rule(s) and group(s) that the service makes available in a file named 'robots.txt' as described in section 2.3:¶

Rule: A line with a key-value pair that defines how a crawler may access URIs. See section 2.2.2.¶
Group: One or more user-agent lines that is followed by one or more rules. The group is terminated by a user-agent line or end of file. See section 2.2.1. The last group may have no rules, which means it implicitly allows everything.¶

2.2. Formal Syntax

Below is an Augmented Backus-Naur Form (ABNF) description, as described in [RFC5234].¶


    robotstxt = *(group / emptyline)
    group = startgroupline                ; We start with a user-agent
           *(startgroupline / emptyline)  ; ... and possibly more
                                          ; user-agents
           *(rule / emptyline)            ; followed by rules relevant
                                          ; for UAs

    startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL

    rule = *WS ("allow" / "disallow") *WS ":"
          *WS (path-pattern / empty-pattern) EOL

    ; parser implementors: add additional lines you need (for
    ; example, sitemaps), and be lenient when reading lines that don't
    ; conform. Apply Postel's law.

    product-token = identifier / "*"
    path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern
    empty-pattern = *WS

    identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A)
    comment = "#" *(UTF8-char-noctl / WS / "#")
    emptyline = EOL
    EOL = *WS [comment] NL ; end-of-line may have
                           ; optional trailing comment
    NL = %x0D / %x0A / %x0D.0A
    WS = %x20 / %x09

    ; UTF8 derived from RFC3629, but excluding control characters

    UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
    UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#'
    UTF8-2 = %xC2-DF UTF8-tail
    UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail /
             %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail
    UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail /
             %xF4 %x80-8F 2UTF8-tail

    UTF8-tail = %x80-BF

2.2.1. The User-Agent Line

Crawlers set a product token to find relevant groups. The product token MUST contain only "a-zA-Z_-" characters. The product token SHOULD be part of the identification string that the crawler sends to the service (for example, in the case of HTTP, the product name SHOULD be in the user-agent header). The identification string SHOULD describe the purpose of the crawler. Here's an example of an HTTP header with a link pointing to a page describing the purpose of the ExampleBot crawler which appears both in the HTTP header and as a product token:¶

Table 1: Example of a user-agent header and user-agent robots.txt token for ExampleBot
HTTP header	robots.txt user-agent line
user-agent: Mozilla/5.0 (compatible; ExampleBot/0.1; https://www.example.com/bot.html)	user-agent: ExampleBot

Crawlers MUST find the group that matches the product token exactly, and then obey the rules of the group. If there is more than one group matching the user-agent, the matching groups' rules MUST be combined into one group. The matching MUST be case-insensitive. If no matching group exists, crawlers MUST obey the first group with a user-agent line with a "*" value, if present. If no group satisfies either condition, or no groups are present at all, no rules apply.¶

2.2.2. The Allow and Disallow Lines

These lines indicate whether accessing a URI that matches the corresponding path is allowed or disallowed.¶

To evaluate if access to a URI is allowed, a robot MUST match the paths in allow and disallow rules against the URI. The matching SHOULD be case sensitive. The most specific match found MUST be used. The most specific match is the match that has the most octets. If an allow and disallow rule is equivalent, the allow SHOULD be used. If no match is found amongst the rules in a group for a matching user-agent, or there are no rules in the group, the URI is allowed. The /robots.txt URI is implicitly allowed.¶

Octets in the URI and robots.txt paths outside the range of the US-ASCII coded character set, and those in the reserved range defined by [RFC3986], MUST be percent-encoded as defined by [RFC3986] prior to comparison.¶

If a percent-encoded US-ASCII octet is encountered in the URI, it MUST be unencoded prior to comparison, unless it is a reserved character in the URI as defined by [RFC3986] or the character is outside the unreserved character range. The match evaluates positively if and only if the end of the path from the rule is reached before a difference in octets is encountered.¶

For example:¶

Table 2: Examples of matching percent-encoded URI components
Path	Encoded Path	Path to Match
/foo/bar?baz=quz	/foo/bar?baz=quz	/foo/bar?baz=quz
/foo/bar?baz=http ://foo.bar	/foo/bar?baz=http%3A %2F%2Ffoo.bar	/foo/bar?baz=http%3A %2F%2Ffoo.bar
/foo/bar/U+E38384	/foo/bar/%E3%83%84	/foo/bar/%E3%83%84
/foo/bar/%E3%83%84	/foo/bar/%E3%83%84	/foo/bar/%E3%83%84
/foo/bar/%62%61%7A	/foo/bar/%62%61%7A	/foo/bar/baz

The crawler SHOULD ignore "disallow" and "allow" rules that are not in any group (for example, any rule that precedes the first user-agent line).¶

Implementers MAY bridge encoding mismatches if they detect that the robots.txt file is not UTF8 encoded.¶

2.2.3. Special Characters

Crawlers SHOULD allow the following special characters:¶

Table 3: List of special characters in robots.txt files
Character	Description	Example
"#"	Designates an end of line comment.	"allow: / # comment in line" "# comment on its own line"
"$"	Designates the end of the match pattern.	"allow: /this/path/exactly$"
"*"	Designates 0 or more instances of any character.	"allow: /this/*/exactly"

If crawlers match special characters verbatim in the URI, crawlers SHOULD use "%" encoding. For example:¶

Table 4: Example of percent-encoding
Percent-encoded Pattern	URI
/path/file-with-a-%2A.html	https://www.example.com/path/file-with-a-*.html
/path/foo-%24	https://www.example.com/path/foo-$

2.2.4. Other Records

Clients MAY interpret other records that are not part of the robots.txt protocol. For example, 'sitemap' [SITEMAPS]. Parsing of other records MUST NOT interfere with the parsing of explicitly defined records in section 2.¶

2.3. Access Method

The rules MUST be accessible in a file named "/robots.txt" (all lower case) in the top level path of the service. The file MUST be UTF-8 encoded (as defined in [RFC3629]) and Internet Media Type "text/plain" (as defined in [RFC2046]).¶

As per [RFC3986], the URI of the robots.txt is:¶

"scheme:[//authority]/robots.txt"¶

For example, in the context of HTTP or FTP, the URI is:¶

          http://www.example.com/robots.txt

          https://www.example.com/robots.txt

          ftp://ftp.example.com/robots.txt

2.3.1. Access Results

2.3.1.1. Successful Access

If the crawler successfully downloads the robots.txt, the crawler MUST follow the parseable rules.¶

2.3.1.2. Redirects

The server may respond to a robots.txt fetch request with a redirect, such as HTTP 301 and HTTP 302. The crawlers SHOULD follow at least five consecutive redirects, even across authorities (for example, hosts in case of HTTP), as defined in [RFC1945].¶

If a robots.txt file is reached within five consecutive redirects, the robots.txt file MUST be fetched, parsed, and its rules followed in the context of the initial authority.¶

If there are more than five consecutive redirects, crawlers MAY assume that the robots.txt is unavailable.¶

2.3.1.3. Unavailable Status

Unavailable means the crawler tries to fetch the robots.txt, and the server responds with unavailable status codes. For example, in the context of HTTP, unavailable status codes are in the 400-499 range.¶

If a server status code indicates that the robots.txt file is unavailable to the client, then crawlers MAY access any resources on the server.¶

2.3.1.4. Unreachable Status

If the robots.txt is unreachable due to server or network errors, this means the robots.txt is undefined and the crawler MUST assume complete disallow. For example, in the context of HTTP, an unreachable robots.txt has a response code in the 500-599 range. For other undefined status codes, the crawler MUST assume the robots.txt is unreachable.¶

If the robots.txt is undefined for a reasonably long period of time (for example, 30 days), clients MAY assume the robots.txt is unavailable or continue to use a cached copy.¶

Robots Exclusion Protocol

Abstract

Status of This Memo

Copyright Notice

Table of Contents

1. Introduction

1.1. Requirements Language

2. Specification

2.1. Protocol Definition

2.2. Formal Syntax

2.2.1. The User-Agent Line

2.2.2. The Allow and Disallow Lines

2.2.3. Special Characters

2.2.4. Other Records

2.3. Access Method

2.3.1. Access Results

2.3.1.1. Successful Access

2.3.1.2. Redirects

2.3.1.3. Unavailable Status

2.3.1.4. Unreachable Status

2.3.1.5. Parsing Errors

2.4. Caching

2.5. Limits

3. Security Considerations

4. IANA Considerations

5. Examples

5.1. Simple Example

5.2. Longest Match

6. References

6.1. Normative References

6.2. Informative References

Authors' Addresses