From the HTTP server's perspective.
相关问题
- How to get JavaScript object in JavaScript code?
- selenium.common.exceptions.WebDriverException: Mes
- How can I scrape LinkedIn company pages with cURL
- Scrapy rules not working when process_request and
- Getting TCP connection timed out: 110: Connection
相关文章
- Scrapy - Select specific link based on text
- Importing URLs for JSOUP to Scrape via Spreadsheet
- Preventing my PHP Web Crawler from Stalling
- Scrapy - Crawl and Scrape a website
- Is this Anti-Scraping technique viable with Robots
- Any idea on how to scrape pages which are behind _
- Symfony2 Crawler - Use UTF-8 with XPATH
- Unable to use proxies in Scrapy project
You can read the official Verifying Googlebot page.
Quoting the page here:
If you're using Apache Webserver, you could have a look at the log file 'log\access.log'.
Then load google's IPs from http://www.iplists.com/nw/google.txt and check whether one of the IPs is contained in your log.
I have captured google crawler request in my asp.net application and here's how the signature of the google crawler looks.
My logs observe many different IPs for google crawler in
66.249.71.*
range. All these IPs are geo-located at Mountain View, CA, USA.A nice solution to check if the request is coming from Google crawler would be to verify the request to contain
Googlebot
andhttp://www.google.com/bot.html
. As I said there are many IPs observed with the same requesting client, I'd not recommend to check on IPs. And may be that's where Client identity come into the picture. So go for verifying client identity.Here's a sample code in C#.
It's important to note that, any Http-client can easily fake this.