Crawl website using wget and limit total number of

2019-05-16 04:40发布

问题:

I want to learn more about crawlers by playing around with the wget tool. I'm interested in crawling my department's website, and finding the first 100 links on that site. So far, the command below is what I have. How do I limit the crawler to stop after 100 links?

wget -r -o output.txt -l 0 -t 1 --spider -w 5 -A html -e robots=on "http://www.example.com"

回答1:

You can't. wget doesn't support this so if you want something like this, you would have to write a tool yourself.

You could fetch the main file, parse the links manually, and fetch them one by one with a limit of 100 items. But it's not something that wget supports.

You could take a look at HTTrack for website crawling too, it has quite a few extra options for this: http://www.httrack.com/



回答2:

  1. Create a fifo file (mknod /tmp/httpipe p)
  2. do a fork
    • in the child do wget --spider -r -l 1 http://myurl --output-file /tmp/httppipe
    • in the father: read line by line /tmp/httpipe
    • parse the output =~ m{^\-\-\d\d:\d\d:\d\d\-\- http://$self->{http_server}:$self->{tcport}/(.*)$}, print $1
    • count the lines; after 100 lines just close the file, it will break the pipe