wget downloads only one index.html file instead of

2019-04-10 17:39发布

问题:

with Wget I normally receive only one -- index.html file. I enter the following string:

wget -e robots=off -r http://www.korpora.org/kant/aa03

which gives back an index.html file, alas, only.

The directory aa03 implies Kant's book, volume 3, there must be some 560 files (pages) or so in it. These pages are readable online, but will not be downloaded. Any remedy?! THX

回答1:

Following that link brings us to:

http://korpora.zim.uni-duisburg-essen.de/kant/aa03/

wget won't follow links that point to domains not specified by the user. Since korpora.zim.uni-duisburg-essen.de is not equal to korpora.org, wget will not follow the links on the index page.

To remedy this, use --span-hosts or -H. -rH is a VERY dangerous combination - combined, you can accidentally crawl the entire Internet - so you'll want to keep its scope very tightly focused. This command will do what you intended to do:

wget -e robots=off -rH -l inf -np -D korpora.org,korpora.zim.uni-duisburg-essen.de http://korpora.org/kant/aa03/index.html

(-np, or --no-parent, will limit the crawl to aa03/. -D will limit it to only those two domains. -l inf will crawl infinitely deep, constrained by -D and -np).