可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm trying to use wget command:
wget -p http://www.example.com
to fetch all the files on the main page. For some websites it works but in most of the cases, it only download the index.html. I've tried the wget -r command but it doesn't work. Any one knows how to fetch all the files on a page, or just give me a list of files and corresponding urls on the page?
回答1:
Wget is also able to download an entire website. But because this can put a heavy load upon the server, wget will obey the robots.txt file.
wget -r -p http://www.example.com
The -p parameter tells wget to include all files, including images. This will mean that all of the HTML files will look how they should do.
So what if you don't want wget to obey by the robots.txt file? You can simply add -e robots=off to the command like this:
wget -r -p -e robots=off http://www.example.com
As many sites will not let you download the entire site, they will check your browsers identity. To get around this, use -U mozilla as I explained above.
wget -r -p -e robots=off -U mozilla http://www.example.com
A lot of the website owners will not like the fact that you are downloading their entire site. If the server sees that you are downloading a large amount of files, it may automatically add you to it's black list. The way around this is to wait a few seconds after every download. The way to do this using wget is by including --wait=X (where X is the amount of seconds.)
you can also use the parameter: --random-wait to let wget chose a random number of seconds to wait. To include this into the command:
wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com
回答2:
Firstly, to clarify the question, the aim is to download index.html
plus all the requisite parts of that page (images, etc). The -p
option is equivalent to --page-requisites
.
The reason the page requisites are not always downloaded is that they are often hosted on a different domain from the original page (a CDN, for example). By default, wget refuses to visit other hosts, so you need to enable host spanning with the --span-hosts
option.
wget --page-requisites --span-hosts 'http://www.amazon.com/'
If you need to be able to load index.html
and have all the page requisites load from the local version, you'll need to add the --convert-links
option, so that URLs in img
src attributes (for example) are rewritten to relative URLs pointing to the local versions.
Optionally, you might also want to save all the files under a single "host" directory by adding the --no-host-directories
option, or save all the files in a single, flat directory by adding the --no-directories
option.
Using --no-directories
will result in lots of files being downloaded to the current directory, so you probably want to specify a folder name for the output files, using --directory-prefix
.
wget --page-requisites --span-hosts --convert-links --no-directories --directory-prefix=output 'http://www.amazon.com/'
回答3:
The link you have provided is the homepage or /index.html, Therefore it's clear that you are getting only a index.html page. For an actual download, for example, for "test.zip" file, you need to add the exact file name at the end. For example use the following link to download test.zip file:
wget -p domainname.com/test.zip
Download a Full Website Using wget --mirror
Following is the command line which you want to execute when you want to download a full website and made available for local viewing.
wget --mirror -p --convert-links -P ./LOCAL-DIR
http://www.example.com
–mirror: turn on options suitable for mirroring.
-p: download all files that are necessary to properly display a given HTML page.
–convert-links: after the download, convert the links in document
for local viewing.
-P ./LOCAL-DIR: save all the files and directories to the specified directory
Download Only Certain File Types Using wget -r -A
You can use this under following situations:
Download all images from a website,
Download all videos from a website,
- Download all PDF files from a website
wget -r -A.pdf http://example.com/test.pdf
回答4:
Another problem might be that the site you're mirroring uses links without www
. So if you specify
wget -p -r http://www.example.com
it won't download any linked (intern) pages because they are from a "different" domain. If this is the case then use
wget -p -r http://example.com
instead (without www
).
回答5:
I know that this thread is old, but try what is mentioned by Ritesh with:
--no-cookies
It worked for me!
回答6:
If you look for index.html
in the wget manual you can find an option --default-page=name
which is index.html
by default. You can change to index.php
for example.
--default-page=index.php
回答7:
If you only get the index.html
and that file looks like it only contains binary data (i.e. no readable text, only control characters), then the site is probably sending the data using gzip
compression.
You can confirm this by running cat index.html | gunzip
to see if it outputs readable HTML.
If this is the case, then wget
's recursive feature (-r
) won't work. There is a patch for wget
to work with gzip compressed data, but it doesn't seem to be in the standard release yet.