How to protect/monitor your site from crawling by

2019-03-17 07:11发布

问题:

Situation:

  • Site with content protected by username/password (not all controlled since they can be trial/test users)
  • a normal search engine can't get at it because of username/password restrictions
  • a malicious user can still login and pass the session cookie to a "wget -r" or something else.

The question would be what is the best solution to monitor such activity and respond to it (considering the site policy is no-crawling/scraping allowed)

I can think of some options:

  1. Set up some traffic monitoring solution to limit the number of requests for a given user/IP.
  2. Related to the first point: Automatically block some user-agents
  3. (Evil :)) Set up a hidden link that when accessed logs out the user and disables his account. (Presumably this would not be accessed by a normal user since he wouldn't see it to click it, but a bot will crawl all links.)

For point 1. do you know of a good already-implemented solution? Any experiences with it? One problem would be that some false positives might show up for very active but human users.

For point 3: do you think this is really evil? Or do you see any possible problems with it?

Also accepting other suggestions.

回答1:

Point 1 has the problem you have mentioned yourself. Also it doesn't help against a slower crawl of the site, or if it does then it may be even worse for legitimate heavy users.

You could turn point 2 around and only allow the user-agents you trust. Of course this won't help against a tool that fakes a standard user-agent.

A variation on point 3 would just be to send a notification to the site owners, then they can decide what to do with that user.

Similarly for my variation on point 2, you could make this a softer action, and just notify that somebody is accessing the site with a weird user agent.

edit: Related, I once had a weird issue when I was accessing a URL of my own that was not public (I was just staging a site that I hadn't announced or linked anywhere). Although nobody should have even known this URL but me, all of a sudden I noticed hits in the logs. When I tracked this down, I saw it was from some content filtering site. Turned out that my mobile ISP used a third party to block content, and it intercepted my own requests - since it didn't know the site, it then fetched the page I was trying to access and (I assume) did some keyword analysis in order to decide whether or not to block. This kind of thing might be a tail end case you need to watch out for.



回答2:

I would not recommend automatic lock-outs, not so much because they are necessarily evil, but because they provide immediate feedback to the malicious user that they tripped a sensor, and let them know not to do the same thing with the next account they sign up with.

And user-agent blocking is probably not going to be very helpful, because obviously user-agents are very easy to fake.

About the best you can probably do is monitoring, but then you still have to ask what you're going to do if you detect malicious behavior. As long as you have uncontrolled access, anyone you lock out can just sign up again under a different identity. I don't know what kind of info you require to get an account, but just a name and e-mail address, for instance, isn't going to be much of a hurdle for anybody.

It's the classic DRM problem -- if anyone can see the information, then anyone can do anything else they want with it. You can make it difficult, but ultimately if someone is really determined, you can't stop them, and you risk interfering with legitimate users and hurting your business.



回答3:

Depending on what kind of malicious user are we talking about.

If they know how to use wget, they can probably set up Tor and get new IP every time, slowly copying everything you have. I don't think you can prevent that without inconveniencing your (paying?) users.

It is same as DRM on games, music, video. If end-user is supposed to see something, you cannot protect it.



回答4:

Short answer: it can't be done reliably.

You can go a long way by simply blocking IP addresses that cause a certain number of hits in some time frame (some webservers support this out of the box, others require some modules, or you can do it by parsing your logfile and e.g. using iptables), but you need to take care not to block the major search engine crawlers and large ISP's proxies.



回答5:

The problem with option 3 is that the auto-logout would be trivial to avoid once the scraper figures out what is going on.



回答6:

@frankodwyer:

  • Only trusted user agents won't work, consider especially IE user-agent string which gets modified by addons or .net version. There would be too many possibilities and it can be faked.
  • variation on point 3. with notification to admin would probably work, but it would mean a non-determined delay if an admin isn't monitoring the logs constantly.

@Greg Hewgill:

  • The auto-logout would also disable the user account. At the least a new account would have to be created leaving more trails like email-address and other information.

Randomly changing logout/disable-url for 3. would be interesting, but don't know how I would implement it yet :)



回答7:

http://recaptcha.net

Either every time someone logs in or while signing up. Maybe you could show a captcha every tenth time.



回答8:

Added comments:

  • I know you can't completely protect something that a normal user should be able to see. I've been on both sides of the problem :)
  • From a developer side what do you think is best ratio of time spent versus protected cases? I'd guess some simple user-agent checks would remove half or more of the potential crawlers, and I know you can spend months developing to protect from the last 1%

Again, from a service provider point of view I'm also interested that one user (crawler) doesn't consume cpu/bandwidth for others so any good bandwidth/request limiters you can point out?

response to comment: Platform specifications: Application based on JBoss Seam running on JBoss AS. However there is an apache2 in front of it. (running on linux)



回答9:

Apache has some bandwidth-by-IP limiting modules AFAIK, and for my own largeish Java/JSP application with a lot of digital content I rolled my own servlet filter to do the same (and limit simultaneous connections from one IP block, etc).

I agree with comments above that it's better to be subtle so that a malicious user cannot tell if/when they've tripped your alarms and thusy don't know to take evasive action. In my case my server just seems to become slow and flaky and unreliable (so no change there then)...

Rgds

Damon