Some other website use cURL and fake http referer to copy my website content. Do we have any way to detect cURL or not real web browser ?
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- PHP Recursively File Folder Scan Sorted by Modific
- Can php detect if javascript is on or not?
- Using similar_text and strpos together
Remember: HTTP is not magic. There's a defined set of headers sent with each HTTP request; if these headers are sent by web-browser, they can as well be sent by any program - including cURL (and libcurl).
Some consider it a curse, but on the other hand, it's a blessing, as it greatly simplifies functional testing of web applications.
UPDATE: As unr3al011 rightly noticed, curl doesn't execute JavaScript, so in theory it's possible to create a page that will behave differently when viewed by grabbers (for example, with setting and, later, checking a specific cookie by JS means).
Still, it'd be a very fragile defense. The page's data still had to be grabbed from server - and this HTTP request (and it's always HTTP request) can be emulated by curl. Check this answer for example of how to defeat such defense.
... and I didn't even mention that some grabbers are able to execute JavaScript. )
put this into root folder as
.htaccess
file. it may help. I found it on one webhosting provider site but dunno what that means :)You can detect cURL-Useragent by the following method. But be warned the useragent could be overwritten by user, anyway default settings could be recognized by:
There is no magic solution to avoid automatic crawling. Everyting a human can do, a robot can do it too. There are only solutions to make the job harder, so hard that only strong skilled geeks may try to pass them.
I was in trouble too some years ago and my first advice is, if you have time, be a crawler yourself (I assume a "crawler" is the guy who crawls your website), this is the best school for the subject. By crawling several websites, I learned different kind of protections, and by associating them I’ve been efficient.
I give you some examples of protections you may try.
Sessions per IP
If a user uses 50 new sessions each minute, you can think this user could be a crawler who does not handle cookies. Of course, curl manages cookies perfectly, but if you couple it with a visit counter per session (explained later), or if your crawler is a noobie with cookie matters, it may be efficient.
It is difficult to imagine that 50 people of the same shared connection will get simultaneousely on your website (it of course depends on your traffic, that is up to you). And if this happens you can lock pages of your website until a captcha is filled.
Idea :
1) you create 2 tables : 1 to save banned ips and 1 to save ip and sessions
2) at the beginning of your script, you delete entries too old from both tables
3) next you check if ip of your user is banned or not (you set a flag to true)
4) if not, you count how much he has sessions for his ip
5) if he has too much sessions, you insert it in your banned table and set a flag
6) you insert his ip on the sessions per ip table if it has not been already inserted
I wrote a code sample to show in a better way my idea.
Visit Counter
If your user uses the same cookie to crawl your pages, you’ll be able to use his session to block it. This idea is quite simple: is it possible that your user visits 60 pages in 60 seconds?
Idea :
Sample code :
An image to download
When a crawler need to do his dirty work, that’s for a large amount of data, and in a shortest possible time. That’s why they don’t download images on pages ; it takes too much bandwith and makes the crawling slower.
This idea (I think the most elegent and the most easy to implement) uses the mod_rewrite to hide code in a .jpg/.png/… an image file. This image should be available on each page you want to protect : it could be your logo website, but you’ll choose a small-sized image (because this image must not be cached).
Idea :
1/ Add those lines to your .htaccess
2/ Create your logo.php with the security
3/ Increment your no_logo_count on each page you need to add security, and check if it reached your limit.
Sample code :
Cookie check
You can create cookies in the javascript side to check if your users does interpret javascript (a crawler using Curl does not, for example).
The idea is quite simple : this is about the same as an image check.
Code :
Protection against proxies
Some words about the different kind of proxies we may find over the web :
It is easy to find a proxy to connect any website, but it is very hard to find high-anonymous proxies.
Some $_SERVER variables may contain keys specifically if your users is behind a proxy (exhaustive list took from this question):
You may give a different behavior (lower limits etc) to your anti crawl securities if you detect one of those keys on your
$_SERVER
variable.Conclusion
There is a lot of ways to detect abuses on your website, so you'll find a solution for sure. But you need to know precisely how your website is used, so your securities will not be aggressive with your "normal" users.
The way of avoid fake referers is tracking the user
You can track the user by one or more of this methods:
Save a cookie in the browser client with some special code (ex: last url visited, a timestamp) and verify it in each response of your server.
Same as before but using sessions instead of explicit cookies
For cookies you should add cryptographic security like.
hash is calulated in PHP by this way
As some have mentioned cURL cannot execute JavaScritp (to my knowledge) so you could possibly try setting someting up like raina77ow suggest but that would not wokrk for other grabbers/donwloaders.
I suggest you try building a bot trap that way you deal with the grabbers/downloaders that can execute JavaScript.
I don't know of any 1 solution to fully prevent this, so my best recommendation would be to try multiple solutions:
1) only allow known user agents such as all mainstream browsers in your .htaccess file
2) Set up your robots.txt to prevent bots
3) Set up a bot trap for bots that do not respect the robots.txt file