I am scraping the T-Mobile website for reviews on Samsung Galaxy S9. I am able to create a Beautiful Soup object for the HTML code, but I cannot fetch the text of reviews which is present inside a span class, also need to iterate through the pages of reviews to collect all the reviews.
I have tried 2 codes, but one is returning an error and the other is returning an empty list. I also cannot find the particular span class I require in the soup object.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
tmo_ratings_s9 = []
req = Request('https://www.t-mobile.com/cell-phone/samsung-galaxy-s9', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
tmo_soup_s9 = BeautifulSoup(webpage, 'html.parser')
tmo_soup_s9.prettify()
for review in tmo_soup_s9.find_all(class_="BVRRReviewText"):
text = review.span.get_text(strip=True)
tmo_soup_s9.append(text)
print(tmo_ratings_s9)
############################################################################
from urllib.request import urlopen
html = urlopen("https://www.t-mobile.com/cell-phone/samsung-galaxy-s9")
soup=BeautifulSoup(html)
ratings = soup.find_all('div', class_='BVRRReviewTextParagraph BVRRReviewTextFirstParagraph BVRRReviewTextLastParagraph')
textofrep = ratings.get_text().strip()
tmo_ratings_s9.append(textofrep)
I expect to get the text of the reviews from all the 8 pages on the webpage and store them in an HTML file.
use selenium or webscraper.io
first if you are using google chrome or mozilla firefox please press ctrl+u from the page, then you will go to the page source. Check if the review content is present anywhere in the source by searching some keywords. If present write the xpath of that data, if not present, check the network section for any json requests sending while the page loads, if not present you will have to use selenium.
In your case send request to this page https://t-mobile.ugc.bazaarvoice.com/9060redes2-en_us/E4F08F7E-8C29-4420-BE87-9226A6C0509D/reviews.djs?format=embeddedhtml
This is a json request send while loading the whole page.
You are not getting the data due to dynamic content loading through script. You can try selenium along with scrapy.