I'd like to mirror a simple password-protected web-portal to some data that i'd like to keep mirrored & up-to-date. Essentially this website is just a directory listing with data organised into folders & I don't really care about keeping html files & other formatting elements.
However there are some huge file types that are too large to download, so I want to ignore these.
Using the wget -m -R/--reject
flag nearly does what I want, except that all files get downloaded, then if they match the -R flag, then they get deleted.
Here's how i'm using wget
:
wget --http-user userName --http-password password -R index.html,*tiff,*bam,*bai -m http://web.server.org/
Which produces output like this, confirming that an excluded file (index.html) (a) gets downloaded, and (b) then gets deleted:
...
--2012-05-23 09:38:38-- http://web.server.org/folder/
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response... 401 Authorization Required
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 2677 (2.6K) [text/html]
Saving to: `web.server.org/folder/index.html'
100%[======================================================================================================================>] 2,677 --.-K/s in 0s
Last-modified header missing -- time-stamps turned off.
2012-05-23 09:38:39 (328 MB/s) - `web.server.org/folder/index.html' saved [2677/2677]
Removing web.server.org/folder/index.html since it should be rejected.
...
is there a way to force wget to reject the file before downloading it?
Is there an alternative that I should consider?
Also, why do i get a 401 Authorization Required
error for every downloaded file, despite supplying username & password. It's like wget
tries to connect un-authenticated every time, before trying the username/password.
thanks, Mark
Pavuk (http://www.pavuk.org) looked like a promising alternative which allows you to mirror websites, excluding files based on url patterns, and filename extensions... but pavuk 0.9.35 seg-faults/dies randomly in the middle of long transfers & does not appear to be actively developed (this version was built Nov 2008).
FYI, here's how I was using it:
pavuk -mode mirror -force_reget -preserve_time -progress -Robots -auth_scheme 3 -auth_name x -auth_passwd x -dsfx 'html,bam,bai,tiff,jpg' -dont_leave_site -remove_old -cdir /path/to/root -subdir /path/to/root -skip_url_pattern ’*icons*’ -skip_url_pattern '*styles*' -skip_url_pattern '*images*' -skip_url_pattern '*bam*' -skip_url_pattern '*solidstats*' http://web.server.org/folder 2>&1 | tee pavuk-
date.log
in the end, wget --exclude-directories
did the trick:
wget --mirror --continue --progress=dot:mega --no-parent \
--no-host-directories --cut-dirs=1 \
--http-user x --http-password x \
--exclude-directories='folder/*/folder_containing_large_data*' --reject "index.html*" \
--directory-prefix /path/to/local/mirror
http://my.server.org/folder
Since the --exclude-directories
wildcards don't span '/', you need to form your queries quite specifically to avoid downloading entire folders.
Mark
Parameter --reject 'pattern'
actually worked for me with wget 1.14.
For example:
wget --reject rpm http://somerpmmirror.org/site/
All the *.rpm
files were not downloaded at all, only indexes.
Warning: File patterns can be unintentionally expanded by bash if they match a file located in working directory. Please use quotes to avoid that:
touch blahblah.rpm
# working
wget -R '*.rpm' ....
# working
wget -R "*.rpm" ....
# not working
wget -R *.rpm ....
Not possible with wget: http://linuxgazette.net/160/misc/lg/how_to_make_wget_exclude_a_particular_link_when_mirroring.html
Well, I am not sure about newer versions, though.
About 401 code, no state is kept (cookie is not used for HTTP authentication), so the username and password must be sent with every request. wget try the request w/o user & pass first before resorting to it.
wget -X directory_to_exclude[,other_directory_to_exclude] -r ftp://URL_ftp_server
SERVER
|-logs
|-etc
|-cache
|-public_html
|-images
|-videos ( want to exclude )
|-files
|-audio (want to exclude)
wget -X /public_html/videos,/public_html/audio ftp:SERVER/public_html/*