How to protect/monitor your site from crawling by

2019-03-17 07:13发布

Situation:

  • Site with content protected by username/password (not all controlled since they can be trial/test users)
  • a normal search engine can't get at it because of username/password restrictions
  • a malicious user can still login and pass the session cookie to a "wget -r" or something else.

The question would be what is the best solution to monitor such activity and respond to it (considering the site policy is no-crawling/scraping allowed)

I can think of some options:

  1. Set up some traffic monitoring solution to limit the number of requests for a given user/IP.
  2. Related to the first point: Automatically block some user-agents
  3. (Evil :)) Set up a hidden link that when accessed logs out the user and disables his account. (Presumably this would not be accessed by a normal user since he wouldn't see it to click it, but a bot will crawl all links.)

For point 1. do you know of a good already-implemented solution? Any experiences with it? One problem would be that some false positives might show up for very active but human users.

For point 3: do you think this is really evil? Or do you see any possible problems with it?

Also accepting other suggestions.

9条回答
Anthone
2楼-- · 2019-03-17 07:40

I would not recommend automatic lock-outs, not so much because they are necessarily evil, but because they provide immediate feedback to the malicious user that they tripped a sensor, and let them know not to do the same thing with the next account they sign up with.

And user-agent blocking is probably not going to be very helpful, because obviously user-agents are very easy to fake.

About the best you can probably do is monitoring, but then you still have to ask what you're going to do if you detect malicious behavior. As long as you have uncontrolled access, anyone you lock out can just sign up again under a different identity. I don't know what kind of info you require to get an account, but just a name and e-mail address, for instance, isn't going to be much of a hurdle for anybody.

It's the classic DRM problem -- if anyone can see the information, then anyone can do anything else they want with it. You can make it difficult, but ultimately if someone is really determined, you can't stop them, and you risk interfering with legitimate users and hurting your business.

查看更多
够拽才男人
3楼-- · 2019-03-17 07:42

Added comments:

  • I know you can't completely protect something that a normal user should be able to see. I've been on both sides of the problem :)
  • From a developer side what do you think is best ratio of time spent versus protected cases? I'd guess some simple user-agent checks would remove half or more of the potential crawlers, and I know you can spend months developing to protect from the last 1%

Again, from a service provider point of view I'm also interested that one user (crawler) doesn't consume cpu/bandwidth for others so any good bandwidth/request limiters you can point out?

response to comment: Platform specifications: Application based on JBoss Seam running on JBoss AS. However there is an apache2 in front of it. (running on linux)

查看更多
戒情不戒烟
4楼-- · 2019-03-17 07:44

The problem with option 3 is that the auto-logout would be trivial to avoid once the scraper figures out what is going on.

查看更多
登录 后发表回答