I want to implement a feature similar to this http://www.tineye.com/parse?url=yahoo.com - allow user upload images from any web page.
Main problem for me is that it takes too much time for web pages with big number of images.
I'm doing this in Django (using curl or urllib) according to the next scheme:
Grab html of the page (takes about 1 sec for big pages):
file = urllib.urlopen(requested_url) html_string = file.read()
Parse it with HTML-parser (BeautifulSoup), looking for img tags, and writing all src of images to a list. (takes about 1 sec too for big pages)
Check sizes of all images in my list and if they are big enough return them in json response (takes very long about 15 sec, when there are about 80 images on a web page). Here's the code of the function:
def get_image_size(uri):
file = urllib.urlopen(uri)
p = ImageFile.Parser()
data = file.read(1024)
if not data:
return None
p.feed(data)
if p.image:
return p.image.size
file.close()
#not an image
return None
As you can see, I'm not loading full image to get it's size, only 1kb of it. But it still takes too much time when there are lot of images (i'm calling this function one time for each image found).
So how can I make it work faster?
May be is there any way for not making a request for every single image?
Any help will be highly appreciated.
Thanks!
You can use the headers attribute of the file like object returned by urllib2.urlopen (I don't know about urllib).
Here's a test I wrote for it. As you can see, it is rather fast, though I imagine some websites would block too many repeated requests.
i can think of few optimisations:
example of HEAD request: