wget Vs urlretrieve of python

I have a task to download Gbs of data from a website. The data is in form of .gz files, each file being 45mb in size.

The easy way to get the files is use "wget -r -np -A files url". This will donwload data in a recursive format and mirrors the website. The donwload rate is very high 4mb/sec.

But, just to play around I was also using python to build my urlparser.

Downloading via Python's urlretrieve is damm slow, possible 4 times as slow as wget. The download rate is 500kb/sec. I use HTMLParser for parsing the href tags.

I am not sure why is this happening. Are there any settings for this.

Thanks

标签： python urllib2 wget

10条回答

太酷不给撩

2楼-- · 2019-02-02 01:23

import subprocess

myurl = 'http://some_server/data/'
subprocess.call(["wget", "-r", "-np", "-A", "files", myurl])

0人赞添加讨论(0) 举报

男人必须洒脱

3楼-- · 2019-02-02 01:23

There shouldn't be a difference really. All urlretrieve does is make a simple HTTP GET request. Have you taken out your data processing code and done a straight throughput comparison of wget vs. pure python?

0人赞添加讨论(0) 举报

Bombasti

4楼-- · 2019-02-02 01:25

urllib works for me as fast as wget. try this code. it shows the progress in percentage just as wget.

import sys, urllib
def reporthook(a,b,c): 
    # ',' at the end of the line is important!
    print "% 3.1f%% of %d bytes\r" % (min(100, float(a * b) / c * 100), c),
    #you can also use sys.stdout.write
    #sys.stdout.write("\r% 3.1f%% of %d bytes" 
    #                 % (min(100, float(a * b) / c * 100), c)
    sys.stdout.flush()
for url in sys.argv[1:]:
     i = url.rfind('/')
     file = url[i+1:]
     print url, "->", file
     urllib.urlretrieve(url, file, reporthook)

0人赞添加讨论(0) 举报

一夜七次

5楼-- · 2019-02-02 01:26

Probably a unit math error on your part.

Just noticing that 500KB/s (kilobytes) is equal to 4Mb/s (megabits).

0人赞添加讨论(0) 举报

何必那么认真

6楼-- · 2019-02-02 01:29

As for the html parsing, the fastest/easiest you will probably get is using lxml As for the http requests themselves: httplib2 is very easy to use, and could possibly speed up downloads because it supports http 1.1 keep-alive connections and gzip compression. There is also pycURL which claims to be very fast (but more difficult to use), and is build on curllib, but I've never used that.

You could also try to download different files concurrently, but also keep in mind that trying to optimize your download times too far may be not very polite towards the website in question.

Sorry for the lack of hyperlinks, but SO tells me "sorry, new users can only post a maximum of one hyperlink"

0人赞添加讨论(0) 举报

我想做一个坏孩纸

7楼-- · 2019-02-02 01:32

Maybe you can wget and then inspect the data in Python?

0人赞添加讨论(0) 举报

1 2 下一页

wget Vs urlretrieve of python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间