Google's Webmaster guidelines state
Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.
My ASP.NET 1.1 site uses custom authentication/authorization and relies pretty heavily on session guids (similar to this approach). I'm worried that allowing non-session tracked traffic will either break my existing code or introduce security vulnerabilities.
What best practices are there for allowing non-session tracked bots to crawl a normally session tracked site? And are there any ways of detecting search bots other than inspecting the user agent (i don't want people to spoof themselves as googlebot to get around my session tracking)?
The correct way to detect bots is by host entry (Dns.GetHostEntry
). Some lame robots require you to track by ip address, but the popular ones generally don't. Googlebot requests come from *.googlebot.com. After you get the host entry, you should check in the IPHostEntry.AddressList
to make sure it contains the original ip address.
Do not even look at the user agent when verifying robots.
See also http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
First of all: We had some issues with simply stripping JSESSIONIDs from responses to known search engines. Most notably, creating a new session for each request caused OutOfMemoryErrors (while you're not using Java, keeping state for thousands of active sessions certainly is a problem for most or all servers/frameworks). This might be solved by reducing session timeout (for bot sessions only - if possible). So if you'd like to go down this path, be warned. And if you do, no need to do DNS lookups. You aren't protecting anything valuable here (compared to Google's First Click Free for instance). If somebody pretends to be a bot that should normally be fine.
Instead, I'd rather suggest to keep tracking sessions (using URL parameters as a fallback for cookies) and add a canonical link tag (<link rel="canonical" href="..." />
, obviously without the session id itself) to each page. See "Make Google Ignore JSESSIONID" or an extensive video featuring Matt Cutts for discussion. Adding this tag isn't very intrusive and could possibly be considered good practice anyway. So basically you would end without any dedicated handling of search engine spiders - which certainly is a Good Thing (tm).
I believe, your approach to the problem is not quite right. You shouldn't rely on session tracking mechanism to decide on access rights, to log malicious users, to detect bots etc.
If you don't want arbitrary users to access certain pages, you should use authentication and authorization. If arbitrary users are allowed to access the page at all, they should be allowed to do it without any session ID (as if it is the first page they visit) - so, the bots will also be able to crwal these pages without any problems.
Malicious users, most likely, could circumvent your session tracking by not using (or tweaking) cookies, referers, URL parameters etc. So, session tracking could not be reliably used here, do just plain logging of any request with its originating IP. Later you could analyze the collected data to detect suspicious activity, try to find users with multiple IPs etc. These analysis is complex and should not be done at runtime.
To detect bots, you could do a reverse DNS lookup on the collected IPs. Again, this could be done offline, so no performance penalty. Generally, the content of the page served should not depend whether the visitor is a bot or an unaunthenticated human user (search engines rightfully treat such behaviour as cheating).
If spoofing is your main concern, you're doing security wrong. You shouldn't give robots any more permissions than users, quite the opposite (hence why users get logins and robots get robots.txt
).
If you're going to give someone special privileges without authentication, it is inherently open for spoofing. IPs can be spoofed. Server-Client communication can be spoofed. And so on.
If you rely on tracking cookies to analyse malicious behaviour, you need to fix that. It should be easy enough to get a good understanding without requesting that the malicious user identify him/herself.
IPs aren't a good substitute for authentication, but they are good enough for grouping if cookies aren't available. Besides, you should be using more reliable means (i.e. a combination of factors) in the first place.