My site has some particular pages that are:
- Already indexed in search engines, but I want to remove them from the indexes.
- Numerous, as they are dynamic (based on query string).
- A bit "heavy." (An overzealous bot can strain the server more than I'd like.)
Because of #2, I'm just going to let them slowly get removed naturally, but I need to settle on a plan.
I started out by doing the following:
- Bots: Abort execution using user-agent detection in the application, and send a basically blank response. (I don't mind if some bots slip through and render the real page, but I'm just blocking some common ones.)
- Bots: Throw a 403 (forbidden) response code.
- All clients: Send "X-Robots-Tag: noindex" header.
- All clients: Added
rel="nofollow"
to the links that lead to these pages. - Did not disallow bots to those pages in robots.txt. (I think it's only useful to disallow bots if you do so from the very beginning, or else after those pages are completely removed from search engines; otherwise, engines can't crawl/access those pages to discover/honor the noindex header, so they wouldn't remove them. I mention this because I think robots.txt might commonly be misunderstood, and it might get suggested as an inappropriate silver bullet.)
However, since then, I think some of those steps were either fairly useless toward my goal, or actually problematic.
- I'm not sure if throwing a 403 to bots is a good idea. Do the search engines see that and completely disregard the X-Robots-Tag? Is it better to just let them respond 200?
- I think
rel="nofollow"
only potentially affects target page rank, and doesn't affect crawling at all.
The rest of the plan seems okay (correct me if I'm wrong), but I'm not sure about the above bullets in the grand scheme.