I am curious about a website and want to do some web crawling at the /s
path. Its robots.txt:
User-Agent: *
Allow: /$
Allow: /debug/
Allow: /qa/
Allow: /wiki/
Allow: /cgi-bin/loginpage
Disallow: /
My questions are:
If you follow the original robots.txt specification, $
has no special meaning, and there is no Allow
field defined. A conforming bot would have to ignore fields it does not know, therefore such a bot would actually see this record:
User-Agent: *
Disallow: /
However, the original robots.txt specification has been extended by various parties. But as the authors of the robots.txt in question did not target a specific bot, we don’t know which "extension" they had in mind.
Typically (but not necessarily, as it’s not formally specified), Allow
overwrites rules specified in Disallow
, and $
represents the end of the URL path.
Following this interpretation (it’s, for example, used by Google), Allow: /$
would mean: You may crawl /
, but you may not crawl /a
, /b
and so on.
So crawling of URLs whose path starts with /s
would not be allowed (neither according to the original spec, thanks to Disallow: /
, nor according to Google’s extension).