Robots.txt: allow only major SE

2019-03-17 13:01发布

Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders?

4条回答
Melony?
2楼-- · 2019-03-17 13:36

As everyone know, the robots.txt is a standard to be obeyed by the crawler and hence only well-behaved agents do so. So, putting it or not doesn't matter.

If you have some data, that you do not show on the site as well, you can just change the permission and improve the security.

查看更多
ら.Afraid
3楼-- · 2019-03-17 13:46

Why?

Anyone doing evil (e.g., gathering email addresses to spam) will just ignore robots.txt. So you're only going to be blocking legitimate search engines, as robots.txt compliance is voluntary.

But — if you insist on doing it anyway — that's what the User-Agent: line in robots.txt is for.

User-agent: googlebot
Disallow: 

User-agent: *
Disallow: /

With lines for all the other search engines you'd like traffic from, of course. Robotstxt.org has a partial list.

查看更多
三岁会撩人
4楼-- · 2019-03-17 13:55

User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
User-agent: Slurp
Allow: /
User-Agent: msnbot
Disallow: 

Slurp is Yahoo's robot

查看更多
走好不送
5楼-- · 2019-03-17 13:56

There are more than 3 major search engines depending on which country you are talking. Facebook seem to be doing a good job listing only legitimate ones: https://facebook.com/robots.txt

So your robots.txt can be something like:

User-agent: Applebot
Allow: /

User-agent: baiduspider
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Facebot
Allow: /

User-agent: Googlebot
Allow: /

User-agent: msnbot
Allow: /

User-agent: Naverbot
Allow: /

User-agent: seznambot
Allow: /

User-agent: Slurp
Allow: /

User-agent: teoma
Allow: /

User-agent: Twitterbot
Allow: /

User-agent: Yandex
Allow: /

User-agent: Yeti
Allow: /

User-agent: *
Disallow: /
查看更多
登录 后发表回答