BeautifulSoup parser can't access html element

I am trying to scrape the hrefs of all the listings. I am fairly new to beautifulsoup and have done a bit of scraping before, but have done some scraping before. But I can't for the life of me extract. See below my code. the container has length zero when I run this script.

I try and select the price too (soup.findAll("span", {"class":"amount"}) , but it doesn't reflect. Any advice most welcome :)

import urllib.request
import urllib.parse
from bs4 import BeautifulSoup

url = 'https://www.takealot.com/computers/laptops-10130'   
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)

respData = str(resp.read())

soup = BeautifulSoup(respData, 'html.parser')

container = soup.find_all("div", {"class": "p-data left"})

标签： python python-3.x parsing web-scraping beautifulsoup

1条回答

beautiful°

2楼-- · 2020-02-15 09:51

The page is rendered with JavaScript. There are several ways to render and scrape it.

I can scrape it with Selenium. First install Selenium:

sudo pip3 install selenium

Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads you can use a headless version of chrome "Chrome Canary" if you are on Windows or Mac.

from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.Chrome()
url = ('https://www.takealot.com/computers/laptops-10130')
browser.get(url)
respData = browser.page_source
browser.quit()
soup = BeautifulSoup(respData, 'html.parser')
containers = soup.find_all("div", {"class": "p-data left"})
for container in containers:
    print(container.text)
    print(container.find("span", {"class": "amount"}).text)

Alternatively use PyQt5

from PyQt5.QtGui import *
from PyQt5.QtCore import *
from PyQt5.QtWebKit import *
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
from bs4 import BeautifulSoup
import sys


class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

url = 'https://www.takealot.com/computers/laptops-10130'
r = Render(url)
respData = r.frame.toHtml()
soup = BeautifulSoup(respData, 'html.parser')
containers = soup.find_all("div", {"class": "p-data left"})
for container in containers:
    print (container.text)
    print (container.find("span", {"class":"amount"}).text)

Alternatively use dryscrape:

from bs4 import BeautifulSoup
import dryscrape

url = 'https://www.takealot.com/computers/laptops-10130'
session = dryscrape.Session()
session.visit(url)
respData = session.body()
soup = BeautifulSoup(respData, 'html.parser')
containers = soup.find_all("div", {"class": "p-data left"})
for container in containers:
    print(container.text)
    print(container.find("span", {"class": "amount"}).text)

Outputs in all cases:

Dell Inspiron 3162 Intel Celeron 11.6" Wifi Notebook (Various Colours)11.6 Inch Display; Wifi Only (Red; White & Blue Available)R 3,999R 4,999i20% OffeB 39,990Discovery Miles 39,990On Credit: R 372 / monthi
3,999
HP 250 G5 Celeron N3060 Notebook - Dark ash silverNBHPW4M70EAR 4,499R 4,999ieB 44,990Discovery Miles 44,990On Credit: R 419 / monthiIn StockShippingThis item is in stock in our CPT warehouse and can be shipped from there. You can also collect it yourself from our warehouse during the week or over weekends.CPT | ShippingThis item is in stock in our JHB warehouse and can be shipped from there. No collection facilities available, sorry!JHBWhen do I get it?
4,499
Asus Vivobook ...

However when testing with your URL I observed the results were not reproducible every time, occasionally I got no content in "containers" after the page had rendered.

0人赞添加讨论(0) 举报

BeautifulSoup parser can't access html element

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间