Scraping attempts getting 403 error

I am trying to scrape a website and I am getting a 403 Forbidden error no matter what I try:

wget
CURL (command line and PHP)
Perl WWW::Mechanize
PhantomJS

I tried all of the above with and without proxies, changing user-agent, and adding a referrer header.

I even copied the request header from my Chrome browser and tried sending with my request using PHP Curl and I am still getting a 403 Forbidden error.

Any input or suggestions on what is triggering the website to block the request and how to bypass?

PHP CURL Example:

$url ='https://www.vitacost.com/productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands%3aquadblock%3asupplements&scrolling=true&No=40&_=1510475982858';
$headers = array(
            'accept:application/json, text/javascript, */*; q=0.01',
            'accept-encoding:gzip, deflate, br',
            'accept-language:en-US,en;q=0.9',               
            'referer:https://www.vitacost.com/productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands:quadblock:supplements',
            'user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36',
            'x-requested-with:XMLHttpRequest',
);

$res = curl_get($url,$headers);
print $res;
exit;

function curl_get($url,$headers=array(),$useragent=''){ 
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_HEADER, true);           
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);   
    curl_setopt($curl, CURLOPT_ENCODING, '');            
    if($useragent)curl_setopt($curl, CURLOPT_USERAGENT,$useragent);             
    if($headers)curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);

    $response = curl_exec($curl);       

    $header_size = curl_getinfo($curl, CURLINFO_HEADER_SIZE);
    $header = substr($response, 0, $header_size);
    $response = substr($response, $header_size);


    curl_close($curl);  
    return $response;
 }

And here is the response I always get:

<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access     

  "http&#58;&#47;&#47;www&#46;vitacost&#46;com&#47;productResults&#46;aspx&#63;" 
on this server.<P>
Reference&#32;&#35;18&#46;55f50717&#46;1510477424&#46;2a24bbad
</BODY>
</HTML>

标签： php perl curl phantomjs scrape

1条回答

Lonely孤独者°

2楼-- · 2020-02-13 06:36

First, note that the site does not like web scraping. As @KeepCalmAndCarryOn pointed out in a comment this site has a /robots.txt where it explicitly asks bots to not crawl specific parts of the site, including the parts you want to scrape. While not legally binding a good citizen will adhere to such request.

Additionally the site seems to employ explicit protection against scraping and tries to make sure that this is really a browser. It looks like the site is behind the Akamai CDN, so maybe the anti-scraping protection is from this CDN.

But I've took the request sent by Firefox (which worked) and then tried to simplify it as much as possible. The following works currently for me, but might of course fail if the site updates its browser detection:

use strict;
use warnings;
use IO::Socket::SSL;

(my $rq = <<'RQ') =~s{\r?\n}{\r\n}g;
GET /productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands%3aquadblock%3asupplements&scrolling=true&No=40&_=151047598285 HTTP/1.1
Host: www.vitacost.com
Accept: */*
Accept-Language: en-US
Connection: keep-alive

RQ

my $cl = IO::Socket::SSL->new('www.vitacost.com:443') or die;
print $cl $rq;
my $hdr = '';
while (<$cl>) {
    $hdr .= $_;
    last if $_ eq "\r\n";
}
warn "[header done]\n";
my $len = $hdr =~m{^Content-length:\s*(\d+)}mi && $1 or die "no length";
read($cl,my $buf,$len);
print $buf;

Interestingly, if I remove the Accept header I get a 403 Forbidden. If I instead remove the Accept-Language it simply hangs. And also interestingly it does not seem to need a User-Agent header.

EDIT: it looks like the bot-detection also uses the source IP of the sender as feature. While the code above works for me from two different systems it fails to work for a third system (hosted at Digitalocean) and just hangs.

0人赞添加讨论(0) 举报

Scraping attempts getting 403 error

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间