How to protect/monitor your site from crawling by

Situation:

Site with content protected by username/password (not all controlled since they can be trial/test users)
a normal search engine can't get at it because of username/password restrictions
a malicious user can still login and pass the session cookie to a "wget -r" or something else.

The question would be what is the best solution to monitor such activity and respond to it (considering the site policy is no-crawling/scraping allowed)

I can think of some options:

Set up some traffic monitoring solution to limit the number of requests for a given user/IP.
Related to the first point: Automatically block some user-agents
(Evil :)) Set up a hidden link that when accessed logs out the user and disables his account. (Presumably this would not be accessed by a normal user since he wouldn't see it to click it, but a bot will crawl all links.)

For point 1. do you know of a good already-implemented solution? Any experiences with it? One problem would be that some false positives might show up for very active but human users.

For point 3: do you think this is really evil? Or do you see any possible problems with it?

Also accepting other suggestions.

标签： web-crawler screen-scraping monitoring

9条回答

贼婆χ

2楼-- · 2019-03-17 07:23

Point 1 has the problem you have mentioned yourself. Also it doesn't help against a slower crawl of the site, or if it does then it may be even worse for legitimate heavy users.

You could turn point 2 around and only allow the user-agents you trust. Of course this won't help against a tool that fakes a standard user-agent.

A variation on point 3 would just be to send a notification to the site owners, then they can decide what to do with that user.

Similarly for my variation on point 2, you could make this a softer action, and just notify that somebody is accessing the site with a weird user agent.

edit: Related, I once had a weird issue when I was accessing a URL of my own that was not public (I was just staging a site that I hadn't announced or linked anywhere). Although nobody should have even known this URL but me, all of a sudden I noticed hits in the logs. When I tracked this down, I saw it was from some content filtering site. Turned out that my mobile ISP used a third party to block content, and it intercepted my own requests - since it didn't know the site, it then fetched the page I was trying to access and (I assume) did some keyword analysis in order to decide whether or not to block. This kind of thing might be a tail end case you need to watch out for.

0人赞添加讨论(0) 举报

smile是对你的礼貌

3楼-- · 2019-03-17 07:24

Short answer: it can't be done reliably.

You can go a long way by simply blocking IP addresses that cause a certain number of hits in some time frame (some webservers support this out of the box, others require some modules, or you can do it by parsing your logfile and e.g. using iptables), but you need to take care not to block the major search engine crawlers and large ISP's proxies.

0人赞添加讨论(0) 举报

在下西门庆

4楼-- · 2019-03-17 07:26

http://recaptcha.net

Either every time someone logs in or while signing up. Maybe you could show a captcha every tenth time.

0人赞添加讨论(0) 举报

爷、活的狠高调

5楼-- · 2019-03-17 07:30

Apache has some bandwidth-by-IP limiting modules AFAIK, and for my own largeish Java/JSP application with a lot of digital content I rolled my own servlet filter to do the same (and limit simultaneous connections from one IP block, etc).

I agree with comments above that it's better to be subtle so that a malicious user cannot tell if/when they've tripped your alarms and thusy don't know to take evasive action. In my case my server just seems to become slow and flaky and unreliable (so no change there then)...

Rgds

Damon

0人赞添加讨论(0) 举报

时光不老，我们不散

6楼-- · 2019-03-17 07:36

Depending on what kind of malicious user are we talking about.

If they know how to use wget, they can probably set up Tor and get new IP every time, slowly copying everything you have. I don't think you can prevent that without inconveniencing your (paying?) users.

It is same as DRM on games, music, video. If end-user is supposed to see something, you cannot protect it.

0人赞添加讨论(0) 举报

狗以群分

7楼-- · 2019-03-17 07:37

@frankodwyer:

Only trusted user agents won't work, consider especially IE user-agent string which gets modified by addons or .net version. There would be too many possibilities and it can be faked.
variation on point 3. with notification to admin would probably work, but it would mean a non-determined delay if an admin isn't monitoring the logs constantly.

@Greg Hewgill:

The auto-logout would also disable the user account. At the least a new account would have to be created leaving more trails like email-address and other information.

Randomly changing logout/disable-url for 3. would be interesting, but don't know how I would implement it yet :)

0人赞添加讨论(0) 举报

1 2 下一页

How to protect/monitor your site from crawling by

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间