When a crawler reads the User-Agent line of a robots.txt file, does it attempt to match it exactly to its own User-Agent or does it attempt to match it as a substring of its User-Agent?
Everything I have read does not explicitly answer this question. According to another StackOverflow thread it is an exact match.
However, the RFC draft makes me believe that it is a substring match. For example, User-Agent: Google
will match "Googlebot" and "Googlebot-News". Here is the relevant quotation from the RFC:
The robot must obey the first record in
/robots.txt
that contains a User-Agent line whose value contains the name token of the robot as a substring.
Additionally, in the "Order of precedence for user-agents" section of Googlebot's documentation it explains that the user agent for Google Images "Googlebot-Image/1.0
" match for User-Agent: googlebot
.
I would appreciate any clarity here, and the answer may be more complicated than my question. For example, Eugene Kalinin's robots module for node mentions splitting the User-Agent to get the "name token" on line 29 and matching against that. If this is true, then Googlebot's User-Agent "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
" will not match User-Agent: Googlebot
.