Dynamic robots.txt

2020-08-16 06:40发布

问题:

Let's say I have a web site for hosting community generated content that targets a very specific set of users. Now, let's say in the interest of fostering a better community I have an off-topic area where community members can post or talk about anything they want, regardless of the site's main theme.

Now, I want most of the content to get indexed by Google. The notable exception is the off-topic content. Each thread has it's own page, but all the threads are listed in the same folder so I can't just exclude search engines from a folder somewhere. It has to be per-page. A traditional robots.txt file would get huge, so how else could I accomplish this?

回答1:

This will work for all well-behaving search engines, just add it to the <head>:

<meta name="robots" content="noindex, nofollow" />


回答2:

If using Apache I'd use mod-rewrite to alias robots.txt to a script that could dynamically generate the necessary content.

Edit: If using IIS you could use ISAPIrewrite to do the same.



回答3:

Simlarly to @James Marshall's suggestion - in ASP.NET you could use an HttpHandler to redirect calls to robots.txt to a script which generated the content.



回答4:

You can implement it by substituting robots.txt with dynamic script generating the output. With Apache You could make simple .htaccess rule to acheive that.

RewriteRule  ^robots\.txt$ /robots.php [NC,L]


回答5:

Just for that thread , make sure your head contains a noindex meta tag. Thats one more way to tell search engines not to crawl your page other than blocking in robots.txt



回答6:

Just keep in mind that a robots.txt disallow will NOT prevent Google from indexing pages that have links from external sites, all it does is prevent crawling internally. See http://www.webmasterworld.com/google/4490125.htm or http://www.stonetemple.com/articles/interview-matt-cutts.shtml.



回答7:

You can disallow search engines to read or index your content by restricting robot meta tags. In this way, spider will consider your instructions and will index only such pages that you want.



回答8:

block dynamic webpage by robots.txt use this code


User-agent: *

Disallow: /setnewsprefs?

Disallow: /index.html?

Disallow: /?

Allow: /?hl=

Disallow: /?hl=*&



标签: seo