How to scrape data from imdb business page?

I am making a project that requires data from imdb business page.I m using python. The data is stored between two tags like this :

Budget

$220,000,000 (estimated)

I want the numeric amount but have not been successful so far. Any suggestions.

回答1:

Take a look at Beautiful Soup, its a useful library for scraping. If you take a look at the source, the "Budget" is inside an h4 element, and the value is next in the DOM. This may not be the best example, but it works for your case:

import urllib
from bs4 import BeautifulSoup


page = urllib.urlopen('http://www.imdb.com/title/tt0118715/?ref_=fn_al_nm_1a')
soup = BeautifulSoup(page.read())
for h4 in soup.find_all('h4'):
    if "Budget:" in h4:
        print h4.next_sibling.strip()

# $15,000,000

回答2:

This is whole bunch of code (you can find your requirement here).
The below Python script will give you,
1) List of Top Box Office movies from IMDb
2) And also the List of Cast for each of them.

from lxml.html import parse

def imdb_bo(no_of_movies=5):
    bo_url = 'http://www.imdb.com/chart/'
    bo_page = parse(bo_url).getroot()
    bo_table = bo_page.cssselect('table.chart')
    bo_total = len(bo_table[0][2])

    if no_of_movies <= bo_total:
        count = no_of_movies
    else:
        count = bo_total

    movies = {}

    for i in range(0, count):
        mo = {}
        mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
        mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
        mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
        mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
        mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
        mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()

        m_page = parse(mo['url']).getroot()
        m_casttable = m_page.cssselect('table.cast_list')

        flag = 0
        mo['cast'] = []
        for cast in m_casttable[0]:
            if flag == 0:
                flag = 1
            else:
                m_starname = cast[1][0][0].text_content().strip()
                mo['cast'].append(m_starname)

        movies[i] = mo

    return movies


if __name__ == '__main__':

    no_of_movies = raw_input("Enter no. of Box office movies to display:")
    bo_movies = imdb_bo(int(no_of_movies))

    for k,v in bo_movies.iteritems():
        print '#'+str(k+1)+'  '+v['title']+' ('+v['year']+')'
        print 'URL: '+v['url']
        print 'Weekend: '+v['weekend']
        print 'Gross: '+v['gross']
        print 'Weeks: '+v['weeks']
        print 'Cast: '+', '.join(v['cast'])
        print '\n'

Output (run in terminal):

parag@parag-innovate:~/python$ python imdb_bo_scraper.py 
Enter no. of Box office movies to display:3
#1  Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan Skarsgård, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden


#2  Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski


#3  Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long

回答3:

Well you asked for python and you asked for a scraping solution.

But there is no need for python and no need to scrape anything because the budget figures are available in the business.list text file available at http://www.imdb.com/interfaces

回答4:

Try IMDbPY and its documentation. To install, just pip install imdbpy

from imdb import IMDb
ia = IMDb()
movie = ia.search_movie('The Untouchables')[0]
ia.update(movie)

#Lots of info for the movie from IMDB
movie.keys()

Though I'm not sure where to find specifically budget info