Get text inside a span class of a particular div

I am scraping the T-Mobile website for reviews on Samsung Galaxy S9. I am able to create a Beautiful Soup object for the HTML code, but I cannot fetch the text of reviews which is present inside a span class, also need to iterate through the pages of reviews to collect all the reviews.

I have tried 2 codes, but one is returning an error and the other is returning an empty list. I also cannot find the particular span class I require in the soup object.

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

tmo_ratings_s9 = []

req = Request('https://www.t-mobile.com/cell-phone/samsung-galaxy-s9', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
tmo_soup_s9 = BeautifulSoup(webpage, 'html.parser')
tmo_soup_s9.prettify()
for review in tmo_soup_s9.find_all(class_="BVRRReviewText"):
    text = review.span.get_text(strip=True)
    tmo_soup_s9.append(text)

print(tmo_ratings_s9)


############################################################################

from urllib.request import urlopen
html = urlopen("https://www.t-mobile.com/cell-phone/samsung-galaxy-s9")

soup=BeautifulSoup(html)

ratings = soup.find_all('div', class_='BVRRReviewTextParagraph BVRRReviewTextFirstParagraph BVRRReviewTextLastParagraph')     
textofrep = ratings.get_text().strip()
tmo_ratings_s9.append(textofrep)

I expect to get the text of the reviews from all the 8 pages on the webpage and store them in an HTML file.

标签： python html web-scraping

3条回答

冷血范

2楼-- · 2020-05-09 19:38

use selenium or webscraper.io

https://www.webscraper.io/

https://www.seleniumhq.org/docs/01_introducing_selenium.jsp

0人赞添加讨论(0) 举报

太酷不给撩

3楼-- · 2020-05-09 19:53

first if you are using google chrome or mozilla firefox please press ctrl+u from the page, then you will go to the page source. Check if the review content is present anywhere in the source by searching some keywords. If present write the xpath of that data, if not present, check the network section for any json requests sending while the page loads, if not present you will have to use selenium.

In your case send request to this page https://t-mobile.ugc.bazaarvoice.com/9060redes2-en_us/E4F08F7E-8C29-4420-BE87-9226A6C0509D/reviews.djs?format=embeddedhtml

This is a json request send while loading the whole page.

0人赞添加讨论(0) 举报

▲ chillily

4楼-- · 2020-05-09 19:54

You are not getting the data due to dynamic content loading through script. You can try selenium along with scrapy.

import scrapy
from selenium import webdriver
from scrapy.http import HtmlResponse

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['t-mobile.com']
    start_urls = ['https://www.t-mobile.com/cell-phone/samsung-galaxy-s9']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        body = str.encode(self.driver.page_source)
        self.parse_response(HtmlResponse(self.driver.current_url, body=body, encoding='utf-8'))

    def parse_response(self, response):
        tmo_ratings_s9 = []
        for review in response.css('#reviews div.BVRRContentReview'):
            text = review.css('.BVRRReviewText::text').get().strip()
            tmo_ratings_s9.append(text)

        print(tmo_ratings_s9)

    def spider_closed(self, spider, reason):
        self.driver.close()

0人赞添加讨论(0) 举报

Get text inside a span class of a particular div

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间