I am trying to scrape a website and I am getting a 403 Forbidden error no matter what I try:
- wget
- CURL (command line and PHP)
- Perl WWW::Mechanize
- PhantomJS
I tried all of the above with and without proxies, changing user-agent, and adding a referrer header.
I even copied the request header from my Chrome browser and tried sending with my request using PHP Curl and I am still getting a 403 Forbidden error.
Any input or suggestions on what is triggering the website to block the request and how to bypass?
PHP CURL Example:
$url ='https://www.vitacost.com/productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands%3aquadblock%3asupplements&scrolling=true&No=40&_=1510475982858';
$headers = array(
'accept:application/json, text/javascript, */*; q=0.01',
'accept-encoding:gzip, deflate, br',
'accept-language:en-US,en;q=0.9',
'referer:https://www.vitacost.com/productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands:quadblock:supplements',
'user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36',
'x-requested-with:XMLHttpRequest',
);
$res = curl_get($url,$headers);
print $res;
exit;
function curl_get($url,$headers=array(),$useragent=''){
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_ENCODING, '');
if($useragent)curl_setopt($curl, CURLOPT_USERAGENT,$useragent);
if($headers)curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
$response = curl_exec($curl);
$header_size = curl_getinfo($curl, CURLINFO_HEADER_SIZE);
$header = substr($response, 0, $header_size);
$response = substr($response, $header_size);
curl_close($curl);
return $response;
}
And here is the response I always get:
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access
"http://www.vitacost.com/productResults.aspx?"
on this server.<P>
Reference #18.55f50717.1510477424.2a24bbad
</BODY>
</HTML>
First, note that the site does not like web scraping. As @KeepCalmAndCarryOn pointed out in a comment this site has a /robots.txt where it explicitly asks bots to not crawl specific parts of the site, including the parts you want to scrape. While not legally binding a good citizen will adhere to such request.
Additionally the site seems to employ explicit protection against scraping and tries to make sure that this is really a browser. It looks like the site is behind the Akamai CDN, so maybe the anti-scraping protection is from this CDN.
But I've took the request sent by Firefox (which worked) and then tried to simplify it as much as possible. The following works currently for me, but might of course fail if the site updates its browser detection:
Interestingly, if I remove the
Accept
header I get a 403 Forbidden. If I instead remove theAccept-Language
it simply hangs. And also interestingly it does not seem to need a User-Agent header.EDIT: it looks like the bot-detection also uses the source IP of the sender as feature. While the code above works for me from two different systems it fails to work for a third system (hosted at Digitalocean) and just hangs.