mirror http website, excluding certain files

2019-03-27 08:45发布

问题:

I'd like to mirror a simple password-protected web-portal to some data that i'd like to keep mirrored & up-to-date. Essentially this website is just a directory listing with data organised into folders & I don't really care about keeping html files & other formatting elements. However there are some huge file types that are too large to download, so I want to ignore these.

Using the wget -m -R/--reject flag nearly does what I want, except that all files get downloaded, then if they match the -R flag, then they get deleted.

Here's how i'm using wget:

wget --http-user userName --http-password password -R index.html,*tiff,*bam,*bai -m http://web.server.org/

Which produces output like this, confirming that an excluded file (index.html) (a) gets downloaded, and (b) then gets deleted:

...
--2012-05-23 09:38:38-- http://web.server.org/folder/
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response... 401 Authorization Required
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 2677 (2.6K) [text/html]
Saving to: `web.server.org/folder/index.html' 100%[======================================================================================================================>] 2,677 --.-K/s in 0s

Last-modified header missing -- time-stamps turned off.
2012-05-23 09:38:39 (328 MB/s) - `web.server.org/folder/index.html' saved [2677/2677]

Removing web.server.org/folder/index.html since it should be rejected.

...

is there a way to force wget to reject the file before downloading it?
Is there an alternative that I should consider?

Also, why do i get a 401 Authorization Required error for every downloaded file, despite supplying username & password. It's like wget tries to connect un-authenticated every time, before trying the username/password.

thanks, Mark

回答1:

Pavuk (http://www.pavuk.org) looked like a promising alternative which allows you to mirror websites, excluding files based on url patterns, and filename extensions... but pavuk 0.9.35 seg-faults/dies randomly in the middle of long transfers & does not appear to be actively developed (this version was built Nov 2008).

FYI, here's how I was using it:
pavuk -mode mirror -force_reget -preserve_time -progress -Robots -auth_scheme 3 -auth_name x -auth_passwd x -dsfx 'html,bam,bai,tiff,jpg' -dont_leave_site -remove_old -cdir /path/to/root -subdir /path/to/root -skip_url_pattern ’*icons*’ -skip_url_pattern '*styles*' -skip_url_pattern '*images*' -skip_url_pattern '*bam*' -skip_url_pattern '*solidstats*' http://web.server.org/folder 2>&1 | tee pavuk-date.log

in the end, wget --exclude-directories did the trick:

wget --mirror --continue --progress=dot:mega --no-parent \
--no-host-directories --cut-dirs=1 \
--http-user x --http-password x \
--exclude-directories='folder/*/folder_containing_large_data*' --reject "index.html*" \
--directory-prefix /path/to/local/mirror
http://my.server.org/folder

Since the --exclude-directories wildcards don't span '/', you need to form your queries quite specifically to avoid downloading entire folders.

Mark



回答2:

Parameter --reject 'pattern' actually worked for me with wget 1.14.

For example:

wget --reject rpm http://somerpmmirror.org/site/

All the *.rpm files were not downloaded at all, only indexes.

Warning: File patterns can be unintentionally expanded by bash if they match a file located in working directory. Please use quotes to avoid that:

touch blahblah.rpm
# working
wget -R '*.rpm' ....
# working
wget -R "*.rpm" ....
# not working
wget -R *.rpm ....


回答3:

Not possible with wget: http://linuxgazette.net/160/misc/lg/how_to_make_wget_exclude_a_particular_link_when_mirroring.html

Well, I am not sure about newer versions, though.

About 401 code, no state is kept (cookie is not used for HTTP authentication), so the username and password must be sent with every request. wget try the request w/o user & pass first before resorting to it.



回答4:

wget -X directory_to_exclude[,other_directory_to_exclude] -r ftp://URL_ftp_server

SERVER
    |-logs
    |-etc
    |-cache
    |-public_html
      |-images
      |-videos ( want to exclude )
      |-files
      |-audio  (want to exclude)

wget -X /public_html/videos,/public_html/audio ftp:SERVER/public_html/*



标签: wget