Upload images from from web-page

I want to implement a feature similar to this http://www.tineye.com/parse?url=yahoo.com - allow user upload images from any web page.

Main problem for me is that it takes too much time for web pages with big number of images.

I'm doing this in Django (using curl or urllib) according to the next scheme:

Grab html of the page (takes about 1 sec for big pages):

file = urllib.urlopen(requested_url)
html_string = file.read()

Parse it with HTML-parser (BeautifulSoup), looking for img tags, and writing all src of images to a list. (takes about 1 sec too for big pages)
Check sizes of all images in my list and if they are big enough return them in json response (takes very long about 15 sec, when there are about 80 images on a web page). Here's the code of the function:


 def get_image_size(uri):
    file = urllib.urlopen(uri)
    p = ImageFile.Parser()
    data = file.read(1024)
    if not data:
        return None
    p.feed(data)
    if p.image:
        return p.image.size
    file.close()
    #not an image
    return None

As you can see, I'm not loading full image to get it's size, only 1kb of it. But it still takes too much time when there are lot of images (i'm calling this function one time for each image found).

So how can I make it work faster?

May be is there any way for not making a request for every single image?

Any help will be highly appreciated.

Thanks!

标签： python django curl urllib

2条回答

混吃等死

2楼-- · 2019-07-23 13:02

You can use the headers attribute of the file like object returned by urllib2.urlopen (I don't know about urllib).

Here's a test I wrote for it. As you can see, it is rather fast, though I imagine some websites would block too many repeated requests.

|milo|laurie|¥ cat test.py
import urllib2
uri = "http://download.thinkbroadband.com/1GB.zip"

def get_file_size(uri):
    file = urllib2.urlopen(uri)
    content_header, = [header for header in file.headers.headers if header.startswith("Content-Length")]
    _, str_length = content_header.split(':')
    length = int(str_length.strip())
    return length

if __name__ == "__main__":
    get_file_size(uri)
|milo|laurie|¥ time python2 test.py
python2 test.py  0.06s user 0.01s system 35% cpu 0.196 total

0人赞添加讨论(0) 举报

Animai°情兽

3楼-- · 2019-07-23 13:14

i can think of few optimisations:

parse as you are reading a file from the stream
use SAX parser (which will be great with point above)
use HEAD to get size of the images
use queue to put your images, then use few threads to connect and get file sizes

example of HEAD request:

$ telnet m.onet.pl 80
Trying 213.180.150.45...
Connected to m.onet.pl.
Escape character is '^]'.
HEAD /_m/33fb7563935e11c0cba62f504d91675f,59,29,134-68-525-303-0.jpg HTTP/1.1
host: m.onet.pl

HTTP/1.0 200 OK
Server: nginx/0.8.53
Date: Sat, 09 Apr 2011 18:32:44 GMT
Content-Type: image/jpeg
Content-Length: 37545
Last-Modified: Sat, 09 Apr 2011 18:29:22 GMT
Expires: Sat, 16 Apr 2011 18:32:44 GMT
Cache-Control: max-age=604800
Accept-Ranges: bytes
Age: 6575
X-Cache: HIT from emka1.m10r2.onet
Via: 1.1 emka1.m10r2.onet:80 (squid)
Connection: close

Connection closed by foreign host.

0人赞添加讨论(0) 举报

Upload images from from web-page

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间