Wget: Skip download if file already exists?

2019-04-07 21:54发布

Answers to Skip download if files exist in wget? say to use -nc, or --no-clobber, but -nc doesn't prevent the sending of the HTTP request and subsequent downloading of the file. It just doesn't do anything after downloading the file if the file has already been fully retrieved. Is there anyway to prevent making the HTTP request if the file already exists?

I installed wget 1.16.3 with Homebrew. After running the command below, wget said something like making HTTP request for each file that already existed, appeared to download it, and then said something like: file already retrieved, nothing to do.

wget --user-agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.12 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.12' \
     --tries=1 \
     --no-clobber \
     --continue \
     --wait=0.3 \
     --random-wait \
     --adjust-extension \
     --load-cookies cookies.txt \
     --save-cookies cookies.txt \
     --keep-session-cookies \
         --recursive \
         --level=inf \
         --convert-links \
         --page-requisites \
         --reject=edit,logout,rate \
         --domains=example.com,s3.amazonaws.com \
         --span-hosts \
         --exclude-directories=/admin \
     http://example.com/

标签: wget
2条回答
疯言疯语
2楼-- · 2019-04-07 22:25

It appears you are using incompatible options, I get the following warning on wget 1.16 linux:

$ wget --no-clobber --convert-links http://example.com
Both --no-clobber and --convert-links were specified, only --convert-links will be used.
查看更多
等我变得足够好
3楼-- · 2019-04-07 22:36

The -nc option does what you're asking for, at least in wget 1.19.1.


On my server, I have a file called index.html which contains links to a.html and b.html.

$ wget -r -nc http://127.0.0.1:8000/

Server logs show this:

127.0.0.1 - - [23/Mar/2017 17:51:25] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2017 17:51:25] "GET /robots.txt HTTP/1.1" 404 -
127.0.0.1 - - [23/Mar/2017 17:51:25] "GET /a.html HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2017 17:51:25] "GET /b.html HTTP/1.1" 200 -

Now I remove b.html and run it again:

$ rm 127.0.0.1\:8000/b.html
$ wget -r -nc http://127.0.0.1:8000/

Server logs show this:

127.0.0.1 - - [23/Mar/2017 17:51:38] "GET /robots.txt HTTP/1.1" 404 -
127.0.0.1 - - [23/Mar/2017 17:51:38] "GET /b.html HTTP/1.1" 200 -

As you can see, only a request for b.html was made.

查看更多
登录 后发表回答