Scrape a tag from Amazon

2020-04-01 02:59发布

I am trying to Scrape a Tag from Amazon.

For this site I try to scrape all the product titles, and the price. The scraped data is like this:

Title    Price
 A        169.99
 B        79.55
 C        39.96
 D        19.90       
 E        34.99        

But, I would love to scrape the "Sponsored" tag (see yellow mark in the screenshot below. Blue part is to respect the brands).

Example of the desired tag to scrape

The desired output:

Title    Price       Sponsored_Tag
 A        169.99      Yes
 B        79.55       Yes
 C        39.96       No
 D        19.90       No
 E        34.99       No 

What have I tried?

I used Python & Scrapy. You can see the item "test", where I tried to catch sponsored on multiple ways. They all failed. Will be great if we can add some changes to the code below (because I use this code for other processes as well).

Many thanks!

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
#import re

class AmazonProductSpider(scrapy.Spider):
    name = "AmazonDeals"
    allowed_domains = ["amazon.com"]

    start_urls = [
            "https://www.amazon.com/s?=shaver+for+men&i=beauty&ref=nb_sb_noss_2"]

    custom_settings = {
            'FEED_URI' : 'Asin_Titles.json',
            'FEED_FORMAT' : 'json'
            }
    def parse(self, response):
        for product in response.css('.s-result-item'): 
            item = AmazonItem()

            #item['test'] = product.css('.s-info-icon').get()
            #item['test'] = product.css('.s-min-height-extra-large').get()
            item['test'] = product.css('.a-spacing-micro').get()

            yield item


class AmazonItem(scrapy.Item):
    test = scrapy.Field()


configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()

d = runner.crawl(AmazonProductSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

Update: this is what we have in "product"

It looks like I didn't captured the 'sponsored' tag as well...

"items": "<div data-asin=\"B01859QHJU\" data-index=\"0\" class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 s-result-item sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n    \n\n\n\n\n\n\n\n\n<div class=\"s-expand-height s-include-content-margin s-border-bottom\">\n<div class=\"a-section a-spacing-medium\">\n\n\n<div class=\"sg-row\">\n  <div class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n        <div class=\"a-section a-spacing-micro s-min-height-extra-large\">\n            \n                \n\n\n<span aria-label=\"Amazon's Choice\">\n    \n\n\n\n\n<a class=\"a-link-normal\" href=\"/Philips-Norelco-Electric-S1560-81/dp/B01859QHJU/ref=ice_ac_b_dpb\">\n    \n        \n            \n                \n\n\n\n\n<span data-component-type=\"s-status-badge-component\" data-component-props='{\"badgeType\":\"amazons-choice\",\"asin\":\"B01859QHJU\"}' class=\"rush-component\">\n  <div class=\"a-row a-badge-region\"><span id=\"B01859QHJU\" class=\"a-badge\" aria-labelledby=\"B01859QHJU-label B01859QHJU-supplementary\" data-a-badge-supplementary-position=\"right\" tabindex=\"0\" data-a-badge-type=\"status\"><span id=\"B01859QHJU-label\" class=\"a-badge-label\" data-a-badge-color=\"sx-gulfstream\" aria-hidden=\"true\"><span class=\"a-badge-label-inner a-text-ellipsis\">\n    \n      <span class=\"a-badge-text\" data-a-badge-color=\"sx-cloud\">Amazon's </span>\n    \n      <span class=\"a-badge-text\" data-a-badge-color=\"ac-orange\">Choice</span>\n    \n  </span></span><span id=\"B01859QHJU-supplementary\" class=\"a-badge-supplementary-text a-text-ellipsis\" aria-hidden=\"true\">for electric razor</span></span></div>\n</span>\n\n            \n        \n        \n    \n</a>\n\n</span>\n\n            \n        </div>\n    </div></div>\n</div>\n\n<div class=\"sg-row\">\n  <div class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n        \n        <div class=\"a-section a-spacing-none\">\n            \n\n\n\n\n\n<span data-component-type=\"s-product-image\" class=\"rush-component\">\n    \n    <a class=\"a-link-normal\" href=\"/Philips-Norelco-Electric-S1560-81/dp/B01859QHJU\">\n        <div class=\"a-section aok-relative s-image-square-aspect\">\n            \n                \n                    <img src=\"https://m.media-amazon.com/images/I/61JJ1+ygJfL._AC_UL320_.jpg\" class=\"s-image\" alt=\"Philips Norelco Electric Shaver 2100, S1560/81\" srcset=\"https://m.media-amazon.com/images/I/61JJ1+ygJfL._AC_UL320_.jpg 1x, https://m.media-amazon.com/images/I/61JJ1+ygJfL._AC_UL480_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/61JJ1+ygJfL._AC_UL640_QL65_.jpg 2x, https://m.media-amazon.com/images/I/61JJ1+ygJfL._AC_UL800_QL65_.jpg 2.5x, https://m.media-amazon.com/images/I/61JJ1+ygJfL._AC_UL960_QL65_.jpg 3x\" data-image-index=\"0\" data-image-load=\"\" data-image-latency=\"s-product-image\" data-image-source-density=\"1\" onload=\"window.uet &amp;&amp; uet('cf')\">\n                \n                \n            \n        </div>\n    </a>\n</span>\n\n        </div>\n        \n  </div></div>\n  <div class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n        \n        <div class=\"a-section a-spacing-none a-spacing-top-small\">\n            \n\n\n\n\n<h2 class=\"a-size-mini a-spacing-none a-color-base s-line-clamp-4\">\n    \n    \n        \n\n\n\n\n<a class=\"a-link-normal a-text-normal\" href=\"/Philips-Norelco-Electric-S1560-81/dp/B01859QHJU\">\n    \n        \n            \n                <span class=\"a-size-base-plus a-color-base a-text-normal\">Philips Norelco Electric Shaver 2100, S1560/81</span>\n            \n        \n        \n    \n</a>\n\n    \n</h2>\n\n        </div>\n        \n            <div class=\"a-section a-spacing-none a-spacing-top-micro\">\n                <div class=\"a-row a-size-small\">\n\n\n<span aria-label=\"4.1 out of 5 stars\">\n    \n\n\n\n\n\n\n    \n        <span class=\"a-declarative\" data-action=\"a-popover\" data-a-popover='{\"max-width\":\"700\",\"closeButton\":false,\"position\":\"triggerBottom\",\"url\":\"/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&amp;asin=B01859QHJU&amp;ref=acr_search__popover&amp;contextId=search\"}'>\n            \n            <a href=\"javascript:void(0)\" class=\"a-popover-trigger a-declarative\"><i class=\"a-icon a-icon-star-small a-star-small-4 aok-align-bottom\"><span class=\"a-icon-alt\">4.1 out of 5 stars</span></i><i class=\"a-icon a-icon-popover\"></i></a>\n        </span>\n    \n    \n\n\n</span>\n\n\n\n<span aria-label=\"3,260\">\n    \n\n\n\n\n<a class=\"a-link-normal\" href=\"/Philips-Norelco-Electric-S1560-81/dp/B01859QHJU#customerReviews\">\n    \n        \n            \n                <span class=\"a-size-base\">3,260</span>\n            \n        \n        \n    \n</a>\n\n</span>\n</div>\n            </div>\n        \n  </div></div>\n  <div class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n        \n        \n            <div class=\"a-section a-spacing-none a-spacing-top-small\">\n                <div class=\"a-row a-size-base a-color-base\"><div class=\"a-row\">\n\n\n\n\n<a class=\"a-size-base a-link-normal s-no-hover a-text-normal\" href=\"/Philips-Norelco-Electric-S1560-81/dp/B01859QHJU\">\n    \n        \n            \n                <span class=\"a-price\" data-a-size=\"l\" data-a-color=\"base\"><span class=\"a-offscreen\">$39.96</span><span aria-hidden=\"true\"><span class=\"a-price-symbol\">$</span><span class=\"a-price-whole\">39<span class=\"a-price-decimal\">.</span></span><span class=\"a-price-fraction\">96</span></span></span>\n            \n        \n        \n    \n</a>\n</div></div>\n            </div>\n        \n        \n            <div class=\"a-section a-spacing-none a-spacing-top-micro\">\n                <div class=\"a-row a-size-base a-color-secondary s-align-children-center\"><div class=\"a-row s-align-children-center\">\n\n\n\n\n<span class=\"aok-inline-block s-image-logo-view\">\n  <span class=\"aok-relative s-icon-text-medium s-prime\">\n    <i class=\"a-icon a-icon-prime a-icon-medium\" role=\"img\" aria-label=\"Amazon Prime\"></i>\n  </span>\n  <span>\n    \n  </span>\n</span>\n\n\n\n<span aria-label=\"Get it as soon as Tomorrow, Jul 11\">\n    <span>Get it as soon as </span><span class=\"a-text-bold\">Tomorrow, Jul 11</span>\n</span>\n</div><div class=\"a-row\">\n\n\n<span aria-label=\"FREE Shipping by Amazon\">\n    <span>FREE Shipping by Amazon</span>\n</span>\n</div></div>\n            </div>\n        \n        \n        \n        \n        \n  </div></div>\n  <div class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n        \n  </div></div>\n  <div class=\"sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32\"><div class=\"sg-col-inner\">\n        \n        \n  </div></div>\n</div>\n</div>\n</div>\n\n</div></div>",

1条回答
放荡不羁爱自由
2楼-- · 2020-04-01 04:02

You can use CSS selector :contains("Sponsored") for testing if the result is ad or not:

import requests
from bs4 import BeautifulSoup
from textwrap import shorten

url = 'https://www.amazon.com/s?k=shaver+for+men&i=beauty&ref=nb_sb_noss_2'
headers={'User-Agent':'Mozilla/5.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

print('{: ^55}{: ^12}{: ^13}'.format('Title', 'Price', 'Sponsored_Tag'))
for div in soup.select('div[data-asin]'):
    title, price = div.select_one('span.a-text-normal').text, div.select_one('.a-offscreen').text if div.select_one('.a-offscreen') else '-'
    sponsored = 'Yes' if div.select_one('span:contains("Sponsored")') else 'No'
    print('{: <55}{: ^12}{: ^13}'.format(shorten(title, 55), price, sponsored))

Prints:

                         Title                            Price    Sponsored_Tag
Braun Series 7 Electric Shaver for Men 7893s, Wet [...]  $169.99        Yes     
Philips Norelco Shaver 4500 (Model AT830/46) [...]        $79.95        Yes     
Philips Norelco Electric Shaver 2100, S1560/81            $39.96        No      
Philips Norelco Multigroom Series 3000, [...]             $19.90        No      
5 In 1 Rechargeable Electric Shaver Razor Men [...]       $34.99        No      
Remington F5-5800 Foil Shaver, Men's Electric [...]       $42.94        No      
Philips Norelco OneBlade hybrid electric trimmer [...]    $34.95        No      
Remington PG6025 All-in-1 Lithium Powered [...]           $19.99        No      
Electric Shaver for Men Waterproof, DynaBliss 3D [...]    $39.96        No      
Panasonic Electric Shaver and Trimmer for Men, [...]      $99.99        No      
Men’s 5-in-1 Electric Shaver & Grooming Kit: [...]        $54.99        No      
Philips Norelco Electric Shaver 8900, Wet & Dry [...]    $149.99        No      
Braun Series 3 ProSkin 3040s Electric Razor for [...]     $69.94        No      
Electric Shaver for Men Wet and Dry Waterproof, [...]     $29.99        No      
Philips Norelco Shaver 4500 (Model AT830/46) [...]        $79.95        No      
Electric Shaver Razor for Men 5 in 1 Rotary [...]         $39.99        No      
MOOSOO M Electric Razor for Men Electric Shaver [...]     $42.99        No      
Panasonic Electric Shaver and Trimmer for Men [...]       $69.99        No      
Wahl Professional 5-Star Series Rechargeable [...]        $79.95        No      
Philips Norelco Multigroom Series 7000, [...]             $54.95        No      
Philips Norelco Electric Shaver 6800, S6880/81, [...]       -           No      
Panasonic Arc5 Electric Razor, Men's 5-Blade [...]          -           No      
SweetLF 3D Rechargeable 100% Waterproof IPX7 [...]        $36.99        No      
Men’s 5-in-1 Electric Shaver & Grooming Kit by [...]      $49.99        No      
Panasonic Hybrid Wet Dry Shaver, Trimmer & [...]          $79.99        No      
Andis 17150 Profoil Lithium                               $50.45        No      
Philips Norelco OneBlade hybrid electric trimmer [...]    $34.95        Yes     
Philips Norelco 9000 Prestige Electric Shaver [...]      $277.49        Yes     
Braun Electric Razor for Men / Electric Shaver, [...]     $49.94        Yes     
Gillette Fusion5 Proglide Men's Razor Handle + 4 [...]    $21.99        Yes     
Electric Razor, Electric Shavers for Men, 4 in 1 [...]    $28.99        No      
Philips Norelco Shaver 4100 (Model AT810/46)              $59.97        No      
Electric Razor for Men,FLYCO Electric Shavers 2 [...]     $24.99        No      
Panasonic Electric Travel Shaver, ES3831K                 $14.65        No      
Electric Razor Shaver for Men, 4 in 1 Dry Wet [...]       $29.99        No      
Braun Series 3 Shave&Style 3010BT 3-in-1 Electric [...]   $59.94        No      
Braun Electric Razor for Men / Electric Shaver, [...]     $49.94        No      
Braun Series 3 310s Electric Razor for Men, [...]         $39.94        No      
Max-Tcare Men's Electric Shaver - Corded and [...]        $37.96        No      
Wahl Speed Shave Rechargeable Lithium Ion Wet/Dry [...]   $32.40        No      
Electric Shaver and Beard Trimmer - 5 in 1 Multi- [...]   $27.98        No      
Panasonic ES-LA63-S Arc4 Men's Electric Razor, [...]     $101.95        No      
Philips Norelco Corded Electric Shaver 1100, [...]        $29.99        No      
INSMART Electric Shaver for men, Waterproof [...]         $33.99        No      
Philips Norelco Electric Shaver 5570 Wet & Dry, [...]    $114.98        No      
HATTEKER Electric Shaver For Men Rotary Shaver [...]      $32.99        No      
Philips Norelco Electric shaver 3100, S3310/81 [...]      $49.95        No      
Dee Banna 5D Wet Dry Electric Rotary Shaver Men's [...]   $24.99        No      
Men’s 5-in-1 Electric Shaver & Grooming Kit Hair [...]    $35.99        No      
(Updated Version) Electric Shaver for Men, [...]          $29.99        No      
MANGROOMER Ultimate Pro Back Shaver with 2 Shock [...]    $49.99        No      
Philips Norelco Bodygroom Series 7000, BG7030/49, [...]   $69.95        No      
Electric Razor for Men 4 in 1 Rotary Shavers [...]        $33.99        No      
Wahl Clipper Stainless Steel Lithium Ion Plus [...]       $59.97        No      
Philips Norelco Electric Shaver 8900, Wet & Dry [...]    $149.99        Yes     
Max-Tcare Men's Electric Shaver - Corded and [...]        $35.96        Yes     
Electric Razor for Men Wet & Dry Cordless Foil [...]      $42.99        Yes     
Electric Shaver for Men Waterproof, DynaBliss 3D [...]    $39.96        Yes     
Panasonic Electric Shaver and Trimmer for Men, [...]      $99.99        Yes     
Men's 5-in-1 Electric Shaver Razor & Grooming Kit [...]   $31.99        Yes     
查看更多
登录 后发表回答