Extracting image src based on attribute with Beaut

I'm using BeautifulSoup to get a HTML page from IMDb, and I would like to extract the poster image from the page. I've got the image based on one of the attributes, but I don't know how to extract the data inside it.

Here's my code:

url = 'http://www.imdb.com/title/tt%s/' % (id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print("before FOR")
for src in soup.find(itemprop="image"): 
    print("inside FOR")
    print(link.get('src'))

标签： python html-parsing web-scraping beautifulsoup

3条回答

SAY GOODBYE

2楼-- · 2020-02-10 06:47

If I understand correctly you are looking for the src of the image, for the extraction of it after that.

In the first place you need to find (using the inspector) in which position in the HTML is the image. For example, in my particle case that I was scrapping soccer team shields, I needed:

m_url = 'http://www.marca.com/futbol/primera/equipos.html'
client = uOpen(m_url) 
page = client.read()
client.close()

page_soup = BS(page, 'html.parser')

teams = page_soup.findAll('li', {'id': 'nombreEquipo'})
for team in teams:
  name = team.h2.text
  shield_url = team.img['src']

Then, you need to process the image. You have to options.

1st: using numpy:

def url_to_image(url):
    '''
    Función para extraer una imagen de una URL
    '''
    resp = uOpen(url)
    image = np.asarray(bytearray(resp.read()), dtype='uint8')
    image = cv2.imdecode(image, cv2.IMREAD_COLOR)
    return image

shield = url_to_image(shield_url)

2nd Using scikit-image library (that you will probably need to install):

shield = io.imread('http:' + shield_url)

Note: Just in this particular example I needed to add http: at the beggining.

Hope it helps!

0人赞添加讨论(0) 举报

手持菜刀，她持情操

3楼-- · 2020-02-10 07:08

You're almost there - just a couple of mistakes. soup.find() gets the first element that matches, not a list, so you don't need to iterate over it. Once you have got the element, you can get its attributes (like src) using dictionary access. Here's a reworked version:

film_id = '0423409'
url = 'http://www.imdb.com/title/tt%s/' % (film_id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
link = soup.find(itemprop="image")
print(link["src"])
# output:
http://ia.media-imdb.com/images/M/MV5BMTg2ODMwNTY3NV5BMl5BanBnXkFtZTcwMzczNjEzMQ@@._V1_SY317_CR0,0,214,317_.jpg

I've changed id to film_id, because id() is a built-in function, and it's bad practice to mask those.

0人赞添加讨论(0) 举报

小情绪 Triste *

4楼-- · 2020-02-10 07:08

I believe your example is very close. You need to use findAll() instead of find() and when you iterate, you switch from src to link. In the below example I switched it to tag

This code is working for me with BeautifulSoup4:

url = 'http://www.imdb.com/title/tt%s/' % (id,)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print "before FOR"
for tag in soup.findAll(itemprop="image"): 
    print "inside FOR"
    print(tag['src'])

0人赞添加讨论(0) 举报

Extracting image src based on attribute with Beaut

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间