wget loop where numbers in URL stay the same

2019-03-04 09:55发布

问题:

I would like to download a bunch of PDF with wget in bash (version 3.2.57(1)-release) on a Mac. The PDF represent old newspaper article, which have been published almost every day between 1810 and 1816.

I tried the following command:

for i in {10..16}; do wget -A pdf -nc -E -nd —no-check-certificate http://digital.slub-dresden.de/fileadmin/data/453041671-18$i0{1..9}0{1..9}/453041671-18$i0{1..9}0{1..9}_tif/jpegs/453041671-18$i0{1..9}0{1..9}.pdf http://digital.slub-dresden.de/fileadmin/data/453041671-18$i{10..12}{10..31}/453041671-18$i{10..12}{10..31}_tif/jpegs/453041671-18$i{10..12}{10..31}.pdf; done 

The unfortunate thing is that the URL contains several numbers I need to iterate which let the argument list grow huge until it eventually exceeds the max limit, e. g.

453041671-18$i0{1..9}0{1..9}/453041671-18$i0{1..9}0{1..9}_tif/jpegs/453041671-18$i0{1..9}0{1..9}.pdf

and I receive an argument list too long error message.

If you take the above link snippet as an example the only existing link would be:

453041671-18000701/453041671-18000701_tif/jpegs/453041671-18000701.pdf

where all month have the same number (18000701), unlike this example:

453041671-18000801/453041671-18000701_tif/jpegs/453041671-18000701.pdf

or any other combination wget is trying.

How can I tell wget to set in each iteration of the month {1..9} and {10..12}, respectively, all numbers the same?

回答1:

Brace expansions don't know about other brace expansions. You can't have multiple brace expansions and have them change in tandem. Instead, you must use a for loop.

for year in {10..16}; do
  for month in `seq -w 1 12`; do
    for day in `seq -w 1 31`; do
      wget ... 453041671-18$year$month$day/453041671-18$year$month${day}_tif/jpegs/453041671-18$year$month$day.pdf
      # The second day is in braces because otherwise it would parse as $day_tif.
    done
  done
done

In case you want to reduce the number of spawned wgets, you can replace wget with echo ... >> listing, and then use the --input-file (-i) option to get wget to pull URLs from that file.