Scrap amazon all deals php curl?

2020-03-30 19:50发布

问题:

I want to scrap amazon all deals page

http://www.amazon.com/gp/goldbox/all-deals/ref=sv_gb_1

So i am using curl php

$request = 'http://www.amazon.com/gp/goldbox/all-deals/ref=sv_gb_1';
        $ch = curl_init();
        curl_setopt($ch,CURLOPT_URL,$request);
        curl_setopt($ch, CURLOPT_HEADER, false);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_TIMEOUT, 80);
        $file_source = curl_exec($ch);
        print_r($file_source);
        exit;

scrapping completed but response page content div empty. contents all came from dynamic ajax requests in amazon. how can i scrap the all deal products using php and curl

My response image link

Update Code

 $request = 'http://www.amazon.com/gp/goldbox/all-deals/ref=sv_gb_1';

        $header[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
        /*$header[] = "Accept-Language: en-US,en;q=0.5";*/
    /*  $header[] = "Accept-Encoding: gzip, deflate";*/
        $header[] = 'Cookie: x-wl-uid=1vlKm5hBxhHPg37UgkrAPYZZaV0wv+T5knGezWJq0AIEWI30hJYp0XouddMIZeemj1LKAi9fDQq7aoFN+mbvlVYPTBQVLFdzs0aeTGWtiCY0Ay63L0ezPfZRKXQHC
/Wum4ywRviFW9es=; session-id-time=2082787201l; session-id=192-9168386-7231424; ubid-main=187-6710460-8617661
; session-token="+SFC4vDx/BvcD8D1Mdgeo2jtnTD0qPHF5j2nWNwbFGcRyW7/o4LBOmBHJosU5W0SgoAd6lhi0NZWg/6o5WE6o45k
+VCT5a5dgj0tltSEkBT80oWT0CDk+jCDEEhIcxnCe6aqkUn6soFiMJHIsMWujo4qyA6A70PC1xKGKdIFMUm3H0DGSdIMqITs4Mjb1
/1vY6GxnPeh5ncasxl+tUN2dHVwwJbj1ZrmyJdDxSDd8/o="; __utma=194891197.2101747155.1434117141.1434356635.1434362529
.4; __utmz=194891197.1434362529.4.4.utmccn=(referral)|utmcsr=stackoverflow.com|utmcct=/questions/11589556
/retrieving-an-amazon-stores-list-of-products-using-php|utmcmd=referral; x-main="Xi0312Ip8BrjoFoj6Zp9OLxDcU6kCvlm4DExlT5yNgHa2b3htenxvUsF2TZR3
?Fn"; s_pers=%20s_vnum%3D1866356399079%2526vn%253D2%7C1866356399079%3B%20s_invisit%3Dtrue%7C1434364356330
%3B%20s_nr%3D1434362556331-Repeat%7C1442138556331%3B; csm-hit=b-1RHERWP84F8S70KRQ903|1434453087266; preferred-geo
=national; UserPref=O9NYa0FpfOIAcRMnkQf7WL3LyhrjCsMBKgKfVxT4zK8uOTF5KjzPAwmz0DuVnfXhdkinEE1BEMgPn09eHwavl
+Hwl1BOSvjp1ewiG1iCXa0R77FsPOGbpq06MWB0MC7Wwff4gehUEAle5IfyFQqKGh1XvJ4YiMFsR2mwmyzzVJTo0WPGZzvvpCVLFmx22cRVwEi4sX8y
+IfEKu76B4p1GHPdZVo1HIwLooo8CT7lboNUi4Hhn6mhtyGCNEDLvWD8NII48Vd9EkcBjUpiSeNroRjYO9yNkj8SI3xJVI0befNipOfxAzPSnuQqeBpqm99bWArk9ZZl
+EM5QKzoPNJSF0FqVnnYavt4G6F/PHedaJVl8pU0A6N9lBjK6YZRFflyaoEYPtUW+nqK0xqO+YusAMAlhHBuW33KMdtt3i6oufQ4yTDqIgAiQ1ZTXcsb2tcu
; s_dslv=1434370132739; lc-main=en_US; aws-target-visitor-id=1434357190046-572838.22_02; aws-target-data
=%7B%22support%22%3A%221%22%7D; s_fid=7BB6DD9CE8128EC3-2A07290402DD6AF6; s_vn=1465893191447%26vn%3D1
; s_nr=1434370132733-New; s_vnum=1866370132735%26vn%3D1; skin=noskin; b2b-main=0';
        $header[] = "Connection: keep-alive";
        $reffer = 'http://www.amazon.com/gp/goldbox/all-deals/ref=sv_gb_1';
        $ch = curl_init();
        curl_setopt($ch,CURLOPT_URL,$request);
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:38.0) Gecko/20100101 Firefox/38.0');
        curl_setopt($ch, CURLOPT_HTTPHEADER, $header); 
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_REFERER, $reffer);
        curl_setopt($ch, CURLOPT_TIMEOUT, 80);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 10);        
        $file_source = curl_exec($ch);

        print_r($file_source);

回答1:

Based on my quick reseach you might query XHRs made by amazon to request deals.

Such dynamic websites get their data thru Ajax JSON calls. One might try to find out where from the data is dynamically downloaded, (using dev. tools or web sniffer), and then query those urls for data.

See the shot. But if you to query them with php Curl you should use/imitate the http headers of that particular request headers (including cookies):

Update

Based on your new curl request...

  1. The amazon page (its js logic) makes XHR to its server for each product item. XHRs look like this: http://www.amazon.com/xa/dealcontent/v2/GetDealMetadata?nocache=1434445645152 not http://www.amazon.com/gp/goldbox/all-deals/ref=sv_gb_1 which is only the referer.

  2. A request for product item is POST, not GET.

  3. You probably got cookie from your browser and inserted it into the php curl header. Wrong. These cookie are of your browser session, not related to a session of your php server that will requests XHRs. Therefore for this use cookie jar, see the post.
  4. The POST's load is an object, should be formed with known structure. Form data: {"requestMetadata":{"marketplaceID":"ATVPDKIKX0DER","sessionID":"175-4567874-0146849","clientID":"goldbox"},"widgetContext":{"pageType":"GoldBox","subPageType":"AllDeals","deviceType":"pc","refRID":"1VFVJBKEYZT3DGWSANXQ","widgetID":"1969939662","slotName":"center-6"},"page":1,"dealsPerPage":8,"itemResponseSize":"NONE","queryProfile":{"featuredOnly":false,"dealTypes":["LIGHTNING_DEAL","BEST_DEAL"],"includedCategories":["283155","599858","154606011"],"excludedExtendedFilters":{"MARKETING_ID":["restrictedcontent"]}}}

See the developer tools picture:

  1. As Michael - sqlbot mentioned, you try to do an action that violates Amazon's terms of Use. But for the scrape technique's sake I still update my answer.