It looks like there are two mainstream solutions for instructing crawlers what to index and what not to index: adding an X-Robot-Tag HTTP header, or indicating a robots.txt.
Is there any advantage to using the former?
It looks like there are two mainstream solutions for instructing crawlers what to index and what not to index: adding an X-Robot-Tag HTTP header, or indicating a robots.txt.
Is there any advantage to using the former?
With robots.txt
you cannot disallow indexing of your documents.
They have different purposes:
robots.txt
can disallow crawling (with Disallow
)X-Robots-Tag
¹ can disallow indexing (with noindex
)(And both offer additional different features, e.g., linking to your Sitemap in robots.txt
, disallowing following links in X-Robots-Tag
, and many more.)
Crawling means accessing the document. Indexing means providing a link to (and possibly metadata from or about) the document in an index. In the typical case, a bot indexes a document after having crawled it, but that’s not necessary.
A bot that isn’t allowed to crawl a document may still index it (without ever accessing it). A bot that isn’t allowed to index a document may still crawl it. You can’t disallow both.
¹ Note that the header is called X-Robots-Tag
, not X-Robot-Tag
. By the way, the metadata name robots
(for the HTML meta
element) is an alternative to the HTTP header.