Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders?
相关问题
- How to get JavaScript object in JavaScript code?
- selenium.common.exceptions.WebDriverException: Mes
- How can I scrape LinkedIn company pages with cURL
- Scrapy rules not working when process_request and
- Getting TCP connection timed out: 110: Connection
相关文章
- Scrapy - Select specific link based on text
- Rendering plain text through PHP
- Importing URLs for JSOUP to Scrape via Spreadsheet
- Preventing my PHP Web Crawler from Stalling
- Scrapy - Crawl and Scrape a website
- Is this Anti-Scraping technique viable with Robots
- Any idea on how to scrape pages which are behind _
- Symfony2 Crawler - Use UTF-8 with XPATH
As everyone know, the robots.txt is a standard to be obeyed by the crawler and hence only well-behaved agents do so. So, putting it or not doesn't matter.
If you have some data, that you do not show on the site as well, you can just change the permission and improve the security.
Why?
Anyone doing evil (e.g., gathering email addresses to spam) will just ignore robots.txt. So you're only going to be blocking legitimate search engines, as robots.txt compliance is voluntary.
But — if you insist on doing it anyway — that's what the
User-Agent:
line in robots.txt is for.With lines for all the other search engines you'd like traffic from, of course. Robotstxt.org has a partial list.
Slurp is Yahoo's robot
There are more than 3 major search engines depending on which country you are talking. Facebook seem to be doing a good job listing only legitimate ones: https://facebook.com/robots.txt
So your robots.txt can be something like: