I am trying to script a wget command to download a web page and all it's attachments and jpegs etc.
When I enter the script by hand, it works, but I need to run this over 35000 times to archive an old web site which is outside of my control (international company politics, but I'm the owner of the data).
My problem has been in variablising the session parameters.
My script so far is as follows:
cnt=35209
# initialise the headers
general_settings='-4 -P xyz --restrict-file-names=windows -nc --limit-rate=250k'
html_page_specific='--convert-links --html-extension'
proxy='--proxy-user=xxxxxx --proxy-password=yyyyyyy'
session="--header=\'Host: mywebsite.com:9090\' --header=\'User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:20.0) Gecko/20100101 Firefox/20.0\'"
address=http://mywebsite.com:9090/browse/item-$cnt
echo $general_settings $proxy $session $cookie $address
echo
echo
echo Getting item-$cnt...
#while [ $cnt -gt 0 ]
#do
# # get the page
wget --debug $general_settings $html_page_specific $proxy $session $cookie $address
# now get the attachments, pdf, txt, jpg, gif, sql, etc...
# wget -A.pdf $general_settings -r $proxy $session $cookie $address
# wget -A.txt $general_settings -r $proxy $session $cookie $address
# wget -A.jpg $general_settings -r $proxy $session $cookie $address
# wget -A.gif $general_settings -r $proxy $session $cookie $address
# wget -A.sql $general_settings -r $proxy $session $cookie $address
# wget -A.doc $general_settings -r $proxy $session $cookie $address
# wget -A.docx $general_settings -r $proxy $session $cookie $address
# wget -A.xls $general_settings -r $proxy $session $cookie $address
# wget -A.xlsm $general_settings -r $proxy $session $cookie $address
# wget -A.xlsx $general_settings -r $proxy $session $cookie $address
# wget -A.xml $general_settings -r $proxy $session $cookie $address
# wget -A.ppt $general_settings -r $proxy $session $cookie $address
# wget -A.pptx $general_settings -r $proxy $session $cookie $address
# wget -A.png $general_settings -r $proxy $session $cookie $address
# wget -A.ps $general_settings -r $proxy $session $cookie $address
# wget -A.mdb $general_settings -r $proxy $session $cookie $address
# ((cnt=cnt-1))
#
#done
but when I run the script, I get the following output
Getting item-35209...
Setting --inet4-only (inet4only) to 1
Setting --directory-prefix (dirprefix) to xyz
Setting --restrict-file-names (restrictfilenames) to windows
Setting --no (noclobber) to 1
Setting --limit-rate (limitrate) to 250k
Setting --convert-links (convertlinks) to 1
Setting --html-extension (htmlextension) to 1
Setting --proxy-user (proxyuser) to xxxxx
Setting --proxy-password (proxypassword) to yyyyy
Setting --header (header) to \'Host:
Setting --header (header) to 'Cookie:
DEBUG output created by Wget 1.11.4 Red Hat modified on linux-gnu.
As you can see, the Host and Cookie sections are not being properly formatted, resulting in the wget command failing to log in and extract the data.
I've been reading the bash man pages, googling, and have tried several related suggestions from SO, but I'm still unable to get the command to execute.
Anyone out there going to be nice enough to show me the correct way to quote quotes in veriables?
Thanks,