when I want to get the page using urllib2, I don't get the full page.
here is the code in python:
import urllib2
import urllib
import socket
from bs4 import BeautifulSoup
# define the frequency for http requests
socket.setdefaulttimeout(5)
# getting the page
def get_page(url):
""" loads a webpage into a string """
src = ''
req = urllib2.Request(url)
try:
response = urllib2.urlopen(req)
src = response.read()
response.close()
except IOError:
print 'can\'t open',url
return src
return src
def write_to_file(soup):
''' i know that I should use try and catch'''
# writing to file, you can check if you got the full page
file = open('output','w')
file.write(str(soup))
file.close()
if __name__ == "__main__":
# this is the page that I'm trying to get
url = 'http://www.imdb.com/title/tt0118799/'
src = get_page(url)
soup = BeautifulSoup(src)
write_to_file(soup) # open the file and see what you get
print "end"
I have struggling to find the problem the whole week !! why I don't get the full page?
thanks for help
I had the same problem, I though it was urllib but it was bs4.
Instead of use
or
try use
You might have to call read multiple times, as long as it does not return an empty string indicating EOF: