Controlling Search Engine Index Removals

2019-09-18 06:21发布

My site has some particular pages that are:

  1. Already indexed in search engines, but I want to remove them from the indexes.
  2. Numerous, as they are dynamic (based on query string).
  3. A bit "heavy." (An overzealous bot can strain the server more than I'd like.)

Because of #2, I'm just going to let them slowly get removed naturally, but I need to settle on a plan.

I started out by doing the following:

  1. Bots: Abort execution using user-agent detection in the application, and send a basically blank response. (I don't mind if some bots slip through and render the real page, but I'm just blocking some common ones.)
  2. Bots: Throw a 403 (forbidden) response code.
  3. All clients: Send "X-Robots-Tag: noindex" header.
  4. All clients: Added rel="nofollow" to the links that lead to these pages.
  5. Did not disallow bots to those pages in robots.txt. (I think it's only useful to disallow bots if you do so from the very beginning, or else after those pages are completely removed from search engines; otherwise, engines can't crawl/access those pages to discover/honor the noindex header, so they wouldn't remove them. I mention this because I think robots.txt might commonly be misunderstood, and it might get suggested as an inappropriate silver bullet.)

However, since then, I think some of those steps were either fairly useless toward my goal, or actually problematic.

  • I'm not sure if throwing a 403 to bots is a good idea. Do the search engines see that and completely disregard the X-Robots-Tag? Is it better to just let them respond 200?
  • I think rel="nofollow" only potentially affects target page rank, and doesn't affect crawling at all.

The rest of the plan seems okay (correct me if I'm wrong), but I'm not sure about the above bullets in the grand scheme.

1条回答
迷人小祖宗
2楼-- · 2019-09-18 06:32

I think this is a good plan:

  1. Bots: Abort execution using user-agent detection in the application, and send a basically blank response. (I don't mind if some bots slip through and render the real page, but I'm just blocking some common ones.)
  2. Bots: Send a 410 (Gone) response code.
    "In general, sometimes webmasters get a little too caught up in the tiny little details and so if the page is gone, it's fine to serve a 404, if you know it's gone for real it's fine to serve a 410,"
    - http://goo.gl/AwJdEz
  3. All clients: Send "X-Robots-Tag: noindex" header. I think this would be extraneous for the known bots who got the 410, but it would cover unknown engines' bots.
  4. All clients: Add rel="nofollow" to the links that lead to these pages. This probably isn't completely necessary, but it wouldn't hurt.
  5. Do not disallow bots to those pages in robots.txt. (It's only useful to disallow bots if you do so from the very beginning, or else after those pages are completely removed from search engines; otherwise, engines can't crawl/access those pages to discover/honor the noindex header, so they wouldn't remove them. I mention this because I think robots.txt might commonly be misunderstood, and it might get suggested as an inappropriate silver bullet.)
查看更多
登录 后发表回答