The protocol language consists of rule(s) and group(s) that the service makes available in a file named "robots.txt" as described in
Section 2.3:
-
Rule:
-
A line with a key-value pair that defines how a crawler may access URIs. See Section 2.2.2.
-
Group:
-
One or more user-agent lines that are followed by one or more rules. The group is terminated by a user-agent line or end of file. See Section 2.2.1. The last group may have no rules, which means it implicitly allows everything.
Below is an Augmented Backus-Naur Form (ABNF) description, as described in [
RFC 5234].
robotstxt = *(group / emptyline)
group = startgroupline ; We start with a user-agent
; line
*(startgroupline / emptyline) ; ... and possibly more
; user-agent lines
*(rule / emptyline) ; followed by rules relevant
; for the preceding
; user-agent lines
startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL
rule = *WS ("allow" / "disallow") *WS ":"
*WS (path-pattern / empty-pattern) EOL
; parser implementors: define additional lines you need (for
; example, Sitemaps).
product-token = identifier / "*"
path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern
empty-pattern = *WS
identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A)
comment = "#" *(UTF8-char-noctl / WS / "#")
emptyline = EOL
EOL = *WS [comment] NL ; end-of-line may have
; optional trailing comment
NL = %x0D / %x0A / %x0D.0A
WS = %x20 / %x09
; UTF8 derived from RFC 3629, but excluding control characters
UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, "#"
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail /
%xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail
UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail /
%xF4 %x80-8F 2UTF8-tail
UTF8-tail = %x80-BF
Crawlers set their own name, which is called a product token, to find relevant groups. The product token
MUST contain only uppercase and lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens ("-"). The product token
SHOULD be a substring of the identification string that the crawler sends to the service. For example, in the case of HTTP [
RFC 9110], the product token
SHOULD be a substring in the User-Agent header. The identification string
SHOULD describe the purpose of the crawler. Here's an example of a User-Agent HTTP request header with a link pointing to a page describing the purpose of the ExampleBot crawler, which appears as a substring in the User-Agent HTTP header and as a product token in the robots.txt user-agent line:
+==========================================+========================+
| User-Agent HTTP header | robots.txt user-agent |
| | line |
+==========================================+========================+
| User-Agent: Mozilla/5.0 (compatible; | user-agent: ExampleBot |
| ExampleBot/0.1; | |
| https://www.example.com/bot.html) | |
+------------------------------------------+------------------------+
Note that the product token (ExampleBot) is a substring of the User-Agent HTTP header.
Crawlers
MUST use case-insensitive matching to find the group that matches the product token and then obey the rules of the group. If there is more than one group matching the user-agent, the matching groups' rules
MUST be combined into one group and parsed according to
Section 2.2.2.
+========================================+========================+
| Two groups that match the same product | Merged group |
| token exactly | |
+========================================+========================+
| user-agent: ExampleBot | user-agent: ExampleBot |
| disallow: /foo | disallow: /foo |
| disallow: /bar | disallow: /bar |
| | disallow: /baz |
| user-agent: ExampleBot | |
| disallow: /baz | |
+----------------------------------------+------------------------+
If no matching group exists, crawlers
MUST obey the group with a user-agent line with the "*" value, if present.
+==================================+======================+
| Two groups that don't explicitly | Applicable group for |
| match ExampleBot | ExampleBot |
+==================================+======================+
| user-agent: * | user-agent: * |
| disallow: /foo | disallow: /foo |
| disallow: /bar | disallow: /bar |
| | |
| user-agent: BazBot | |
| disallow: /baz | |
+----------------------------------+----------------------+
If no group matches the product token and there is no group with a user-agent line with the "*" value, or no groups are present at all, no rules apply.
These lines indicate whether accessing a URI that matches the corresponding path is allowed or disallowed.
To evaluate if access to a URI is allowed, a crawler
MUST match the paths in "allow" and "disallow" rules against the URI. The matching
SHOULD be case sensitive. The matching
MUST start with the first octet of the path. The most specific match found
MUST be used. The most specific match is the match that has the most octets. Duplicate rules in a group
MAY be deduplicated. If an "allow" rule and a "disallow" rule are equivalent, then the "allow" rule
SHOULD be used. If no match is found amongst the rules in a group for a matching user-agent or there are no rules in the group, the URI is allowed. The /robots.txt URI is implicitly allowed.
Octets in the URI and robots.txt paths outside the range of the ASCII coded character set, and those in the reserved range defined by [
RFC 3986],
MUST be percent-encoded as defined by [
RFC 3986] prior to comparison.
If a percent-encoded ASCII octet is encountered in the URI, it
MUST be unencoded prior to comparison, unless it is a reserved character in the URI as defined by [
RFC 3986] or the character is outside the unreserved character range. The match evaluates positively if and only if the end of the path from the rule is reached before a difference in octets is encountered.
For example:
+==================+=======================+=======================+
| Path | Encoded Path | Path to Match |
+==================+=======================+=======================+
| /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz |
+------------------+-----------------------+-----------------------+
| /foo/bar?baz= | /foo/bar?baz= | /foo/bar?baz= |
| https://foo.bar | https%3A%2F%2Ffoo.bar | https%3A%2F%2Ffoo.bar |
+------------------+-----------------------+-----------------------+
| /foo/bar/ | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 |
| U+E38384 | | |
+------------------+-----------------------+-----------------------+
| /foo/ | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 |
| bar/%E3%83%84 | | |
+------------------+-----------------------+-----------------------+
| /foo/ | /foo/bar/%62%61%7A | /foo/bar/baz |
| bar/%62%61%7A | | |
+------------------+-----------------------+-----------------------+
The crawler
SHOULD ignore "disallow" and "allow" rules that are not in any group (for example, any rule that precedes the first user-agent line).
Implementors
MAY bridge encoding mismatches if they detect that the robots.txt file is not UTF-8 encoded.
Crawlers
MUST support the following special characters:
+===========+===================+==============================+
| Character | Description | Example |
+===========+===================+==============================+
| # | Designates a line | allow: / # comment in line |
| | comment. | |
| | | # comment on its own line |
+-----------+-------------------+------------------------------+
| $ | Designates the | allow: /this/path/exactly$ |
| | end of the match | |
| | pattern. | |
+-----------+-------------------+------------------------------+
| * | Designates 0 or | allow: /this/*/exactly |
| | more instances of | |
| | any character. | |
+-----------+-------------------+------------------------------+
If crawlers match special characters verbatim in the URI, crawlers
SHOULD use "%" encoding. For example:
+============================+====================================+
| Percent-encoded Pattern | URI |
+============================+====================================+
| /path/file-with-a-%2A.html | https://www.example.com/path/ |
| | file-with-a-*.html |
+----------------------------+------------------------------------+
| /path/foo-%24 | https://www.example.com/path/foo-$ |
+----------------------------+------------------------------------+
Crawlers
MAY interpret other records that are not part of the robots.txt protocol -- for example, "Sitemaps" [
SITEMAPS]. Crawlers
MAY be lenient when interpreting other records. For example, crawlers may accept common misspellings of the record.
Parsing of other records
MUST NOT interfere with the parsing of explicitly defined records in
Section 2. For example, a "Sitemaps" record
MUST NOT terminate a group.
The rules
MUST be accessible in a file named "/robots.txt" (all lowercase) in the top-level path of the service. The file
MUST be UTF-8 encoded (as defined in [
RFC 3629]) and Internet Media Type "text/plain" (as defined in [
RFC 2046]).
As per [
RFC 3986], the URI of the robots.txt file is:
"scheme:[//authority]/robots.txt"
For example, in the context of HTTP or FTP, the URI is:
https://www.example.com/robots.txt
ftp://ftp.example.com/robots.txt
If the crawler successfully downloads the robots.txt file, the crawler
MUST follow the parseable rules.
It's possible that a server responds to a robots.txt fetch request with a redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers
SHOULD follow at least five consecutive redirects, even across authorities (for example, hosts in the case of HTTP).
If a robots.txt file is reached within five consecutive redirects, the robots.txt file
MUST be fetched, parsed, and its rules followed in the context of the initial authority.
If there are more than five consecutive redirects, crawlers
MAY assume that the robots.txt file is unavailable.
"Unavailable" means the crawler tries to fetch the robots.txt file and the server responds with status codes indicating that the resource in question is unavailable. For example, in the context of HTTP, such status codes are in the 400-499 range.
If a server status code indicates that the robots.txt file is unavailable to the crawler, then the crawler
MAY access any resources on the server.
If the robots.txt file is unreachable due to server or network errors, this means the robots.txt file is undefined and the crawler
MUST assume complete disallow. For example, in the context of HTTP, server errors are identified by status codes in the 500-599 range.
If the robots.txt file is undefined for a reasonably long period of time (for example, 30 days), crawlers
MAY assume that the robots.txt file is unavailable as defined in
Section 2.3.1.3 or continue to use a cached copy.
Crawlers
MUST try to parse each line of the robots.txt file. Crawlers
MUST use the parseable rules.
Crawlers
MAY cache the fetched robots.txt file's contents. Crawlers
MAY use standard cache control as defined in [
RFC 9111]. Crawlers
SHOULD NOT use the cached version for more than 24 hours, unless the robots.txt file is unreachable.
Crawlers
SHOULD impose a parsing limit to protect their systems; see
Section 3. The parsing limit
MUST be at least 500 kibibytes [
KiB].