I'm using BeautifulSoup to get a HTML page from IMDb, and I would like to extract the poster image from the page. I've got the image based on one of the attributes, but I don't know how to extract the data inside it.
Here's my code:
url = 'http://www.imdb.com/title/tt%s/' % (id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print("before FOR")
for src in soup.find(itemprop="image"):
print("inside FOR")
print(link.get('src'))
If I understand correctly you are looking for the src of the image, for the extraction of it after that.
In the first place you need to find (using the inspector) in which position in the HTML is the image. For example, in my particle case that I was scrapping soccer team shields, I needed:
Then, you need to process the image. You have to options.
1st: using numpy:
shield = url_to_image(shield_url)
2nd Using scikit-image library (that you will probably need to install):
Note: Just in this particular example I needed to add http: at the beggining.
Hope it helps!
You're almost there - just a couple of mistakes.
soup.find()
gets the first element that matches, not a list, so you don't need to iterate over it. Once you have got the element, you can get its attributes (likesrc
) using dictionary access. Here's a reworked version:I've changed
id
tofilm_id
, becauseid()
is a built-in function, and it's bad practice to mask those.I believe your example is very close. You need to use findAll() instead of find() and when you iterate, you switch from src to link. In the below example I switched it to
tag
This code is working for me with BeautifulSoup4: