I've come across a question where a user is having difficulties accessing an image through a script (using cURL
/file_get_contents()
):
How to save an image from url using PHP?
The image link seems to return a 403 error when using file_get_contents()
to request it. But in cURL, a more detailed error is returned:
You were denied access to the system. Turn off the engine or Surf
Proxy, Fake IP if you really want to access. Proxy or not accepted
from any Web tools Intrusion Prevention System.
Binh Minh Online Data Services @ 2008 - 2012
I also failed to access the same image after fiddling around with a cURL request myself. I tried changing the user-agent to my exact browsers user-agent which can successfully access the image. I've also tried the script on my personal local server, which (obviously) uses the same IP address as my browser... So as far as I know, user-agents and IP addresses are out of the situation.
How else can someone detect a script performing a request?
BTW, this is not for anything crazy. I'm just curious xD
It is indeed a cookie that is set by JavaScript then a redirect, to the original image. The problem is that curl/fgc wont parse the html and set the cookie its only cookies set by the server that curl will store in its cookie jar.
This is the code you get before the redirect, it makes a cookie via JavaScript with no name but location.href as the value:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<HEAD>
<TITLE>http://phim.xixam.com/thumb/giotdang.jpeg</TITLE>
<meta http-equiv="Refresh" content="0;url=http://phim.xixam.com/thumb/giotdang.jpeg">
</HEAD>
<script type="text/javascript">
window.onload = function checknow() {
var today = new Date();
var expires = 3600000*1*1;
var expires_date = new Date(today.getTime() + (expires));
var ua = navigator.userAgent.toLowerCase();
if ( ua.indexOf( "safari" ) != -1 ) { document.cookie = "location.href"; } else { document.cookie = "location.href;expires=" + expires_date.toGMTString(); }
}
</script>
<BODY>
</BODY></HTML>
But all is not lost, because by pre-setting/forging the cookie you can circumvent this security measure (a reason why using cookies for any kind of security is bad).
cookie.txt
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
phim.xixam.com FALSE /thumb/ FALSE 1338867990 location.href
So the finnished curl script would look something like:
<?php
function curl_get($url){
$return = '';
(function_exists('curl_init')) ? '' : die('cURL Must be installed!');
//Forge the cookie
$expire = time()+3600000*1*1;
$cookie =<<<COOKIE
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
phim.xixam.com FALSE /thumb/ FALSE $expire location.href
COOKIE;
file_put_contents(dirname(__FILE__).'/cookie.txt',$cookie);
//Browser Masquerade cURL request
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/json,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";
curl_setopt($curl, CURLOPT_COOKIEJAR, dirname(__FILE__).'/cookie.txt');
curl_setopt($curl, CURLOPT_COOKIEFILE, dirname(__FILE__).'/cookie.txt');
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0 Firefox/5.0');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_HEADER, 0);
//Pass the referer check
curl_setopt($curl, CURLOPT_REFERER, 'http://xixam.com/forum.php');
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 30);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$html = curl_exec($curl);
curl_close($curl);
return $html;
}
$image = curl_get('http://phim.xixam.com/thumb/giotdang.jpeg');
file_put_contents('test.jpg',$image);
?>
The only way to stop a crawler is to log all your visitors ips in your database and increment a value based on visits per ip, then once a week or so look at the top hits by ip and then do a reverse lookup of the ip and see if its from a hosting provider if so block it at your firewall or in htaccess, other then that you cant really stop the request to a resource if its publicly available as any hurdle can be overcome.
Hope it helps.