Download all files of a particular type from a web

2020-06-03 07:51发布

问题:

The following did not work.

wget -r -A .pdf home_page_url

It stop with the following message:

....
Removing site.com/index.html.tmp since it should be rejected.
FINISHED

I don't know why it only stops in the starting url, do not go into the links in it to search for the given file type.

Any other way to recursively download all pdf files in an website. ?

回答1:

It may be based on a robots.txt. Try adding -e robots=off.

Other possible problems are cookie based authentication or agent rejection for wget. See these examples.

EDIT: The dot in ".pdf" is wrong according to sunsite.univie.ac.at



回答2:

the following cmd works for me, it will download pictures of a site

wget -A pdf,jpg,png -m -p -E -k -K -np http://site/path/


回答3:

This is certainly because of the links in the HTML don't end up with /.

Wget will not follow this has it think it's a file (but doesn't match your filter):

<a href="link">page</a>

But will follow this:

<a href="link/">page</a>

You can use the --debug option to see if it's the actual problem.

I don't know any good solution for this. In my opinion this is a bug.