可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm using BeautifulSoup to get a HTML page from IMDb, and I would like to extract the poster image from the page. I've got the image based on one of the attributes, but I don't know how to extract the data inside it.

Here's my code:

url = 'http://www.imdb.com/title/tt%s/' % (id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print("before FOR")
for src in soup.find(itemprop="image"): 
    print("inside FOR")
    print(link.get('src'))

回答1:

You're almost there - just a couple of mistakes. soup.find() gets the first element that matches, not a list, so you don't need to iterate over it. Once you have got the element, you can get its attributes (like src) using dictionary access. Here's a reworked version:

film_id = '0423409'
url = 'http://www.imdb.com/title/tt%s/' % (film_id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
link = soup.find(itemprop="image")
print(link["src"])
# output:
http://ia.media-imdb.com/images/M/MV5BMTg2ODMwNTY3NV5BMl5BanBnXkFtZTcwMzczNjEzMQ@@._V1_SY317_CR0,0,214,317_.jpg

I've changed id to film_id, because id() is a built-in function, and it's bad practice to mask those.

回答2:

I believe your example is very close. You need to use findAll() instead of find() and when you iterate, you switch from src to link. In the below example I switched it to tag

This code is working for me with BeautifulSoup4:

url = 'http://www.imdb.com/title/tt%s/' % (id,)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print "before FOR"
for tag in soup.findAll(itemprop="image"): 
    print "inside FOR"
    print(tag['src'])

回答3:

If I understand correctly you are looking for the src of the image, for the extraction of it after that.

In the first place you need to find (using the inspector) in which position in the HTML is the image. For example, in my particle case that I was scrapping soccer team shields, I needed:

m_url = 'http://www.marca.com/futbol/primera/equipos.html'
client = uOpen(m_url) 
page = client.read()
client.close()

page_soup = BS(page, 'html.parser')

teams = page_soup.findAll('li', {'id': 'nombreEquipo'})
for team in teams:
  name = team.h2.text
  shield_url = team.img['src']

Then, you need to process the image. You have to options.

1st: using numpy:

def url_to_image(url):
    '''
    Función para extraer una imagen de una URL
    '''
    resp = uOpen(url)
    image = np.asarray(bytearray(resp.read()), dtype='uint8')
    image = cv2.imdecode(image, cv2.IMREAD_COLOR)
    return image

shield = url_to_image(shield_url)

2nd Using scikit-image library (that you will probably need to install):

shield = io.imread('http:' + shield_url)

Note: Just in this particular example I needed to add http: at the beggining.

Hope it helps!

Extracting image src based on attribute with Beaut

问题:

回答1:

回答2:

回答3:

收藏的人(0)

Extracting image src based on attribute with Beaut

问题:

回答1:

回答2:

回答3:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮