Programmatic Bot Detection

2019-01-21 17:56发布

问题:

I need to write some code to analyze whether or not a given user on our site is a bot. If it's a bot, we'll take some specific action. Looking at the User Agent is not something that is successful for anything but friendly bots, as you can specify any user agent you want in a bot. I'm after behaviors of unfriendly bots. Various ideas I've had so far are:

  • If you don't have a browser ID
  • If you don't have a session ID
  • Unable to write a cookie

Obviously, there are some cases where a legitimate user will look like a bot, but that's ok. Are there other programmatic ways to detect a bot, or either detect something that looks like a bot?

回答1:

Clarify why you want to exclude bots, and how tolerant you are of mis-classification.

That is, do you have to exclude every single bot at the expense of treating real users like bots? Or is it okay if bots crawl your site as long as they don't have a performance impact?

The only way to exclude all bots is to shut down your web site. A malicious user can distribute their bot to enough machines that you would not be able to distinguish their traffic from real users. Tricks like JavaScript and CSS will not stop a determined attacker.

If a "happy medium" is satisfactory, one trick that might be helpful is to hide links with CSS so that they are not visible to users in a browser, but are still in the HTML. Any agent that follows one of these "poison" links is a bot.



回答2:

User agents can be faked. Captchas have been cracked. Valid cookies can be sent back to your server with page requests. Legitimate programs, such as Adobe Acrobat Pro can go in and download your web site in one session. Users can disable JavaScript. Since there is no standard measure of "normal" user behaviour, it cannot be differentiated from a bot.

In other words: it can't be done short of pulling the user into some form of interactive chat and hope they pass the Turing Test, then again, they could be a really good bot too.



回答3:

You say that it is okay that some users appear as bots, therefore,

Most bots don't run javascript. Use javascript to do an Ajax like call to the server that identifies this IP address as NonBot. Store that for a set period of time to identify future connections from this IP as good clients and to prevent further wasteful javascript calls.



回答4:

A simple test is javascript:

<script type="text/javascript">
document.write('<img src="/not-a-bot.' + 'php" style="display: none;">');
</script>

The not-a-bot.php can add something into the session to flag that the user is not a bot, then return a single pixel gif.

The URL is broken up to disguise it from the bot.



回答5:

Here's an idea:

Most bots don't download css, javascript and images. They just parse the html.

If you could keep track in a user's session whether or not they download all of the above, e.g. by routing all of the download requests through a script that logs the attempts, then you could quickly identify users that only download the raw html (very few normal users will do this).



回答6:

For each session on the server you can determine if the user was at any point clicking or typing too fast. After a given number of repeats, set the "isRobot" flag to true and conserve resources within that session. Normally you don't tell the user that he's been robot-detected, since he'd just start a new session in that case.



回答7:

Well, this is really for a particular page of the site. We don't want a bot submitting the form b/c it messes up tracking. Honestly, the friendly bots, Google, Yahoo, etc aren't a problem as they don't typically fill out the form to begin with. If we suspected someone of being a bot, we might show them a captcha image or something like that... If they passed, they're not a bot and the form submits...

I've heard things like putting a form in flash, or making the submit javascript, but I'd prefer not to prevent real users from using the site until I suspected they were a bot...



回答8:

I think your idea with checking the session id will already be quite useful.

Another idea: You could check whether embedded resources are downloaded as well.

A bot which does not load images (e.g. to save time and bandwidth) should be distinguishable from a browser which typically will load images embedded into a page.

Such a check however might not be suited as a real-time check because you would have to analyze some sort of server log which might be time consuming.



回答9:

Hey, thanks for all the responses. I think that a combination of a few suggestions will work well. Mainly, the hidden form element that times how fast the form was filled out, and possibly the "poison link" idea. I think that it will cover most basis. When you're talking about bots, you're not going to find them all, so there's no point thinking that you will... Silly bots.



标签: bots