Popups block bulk download of pdfs from website wi

2019-06-12 11:43发布

I would like to download some free-to-download pdfs (copies of old newspaper) from this website of the Austrian National Library with wget using the bash script below:

for year in {14..57}; do
  for month in `seq -w 1 12`; do # -w for leading zero
    for day in `seq -w 1 31`; do
      wget -A pdf -nc -E -nd --no-check-certificate --content-disposition http://anno.onb.ac.at/pdfs/ONB_lzg_18$year$month$day.pdf
    done
  done
done

Aside of some newspaper issues not being available, I cannot download any issues even though they exist. I would get errors such as the one for the existing issue of June 30, 1814 for example:

http://anno.onb.ac.at/pdfs/ONB_lzg_18140630.pdf
Aufl"osen des Hostnamens anno.onb.ac.at (anno.onb.ac.at)... 193.170.112.230
Verbindungsaufbau zu anno.onb.ac.at (anno.onb.ac.at)|193.170.112.230|:80 ... verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet ... 404 Not Found
FEHLER 404: Not Found.

However, if you were to download the corresponding pdfs manually (here, see upper-right corner) you have to press "ok" in a pop-up acknowledgement. Once you did this, I can even download the issue via wget without a problem.

How can I tell wget to confirm via the command line the acknowledgements (the question you get once you want to download a pdf), see screenshot below? Is there a command in wget for that?

enter image description here

1条回答
倾城 Initia
2楼-- · 2019-06-12 12:06

There are two issues in your code.

  1. lgz newspaper is not available for all the dates
  2. The PDF are not always generated and cached on the URL you used. You need to first run the other URL to make sure the PDF is generated

Below is the updated code that should work

#!/bin/bash

for year in {14..57}; do
  DATES=$(curl -sS "http://anno.onb.ac.at/cgi-content/anno?aid=lzg&datum=18$year&zoom=33" |   gawk 'match($0, /datum=([^&]+)/, ary) {print ary[1]}' | xargs echo)

  for date in $DATES
  do 
      echo "Downloading for $date"

      curl "http://anno.onb.ac.at/cgi-content/anno_pdf.pl?aid=lzg&datum=$date" -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' -H 'DNT: 1' -H "Referer: http://anno.onb.ac.at/cgi-content/anno?aid=lzg&datum=$date" -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: en-US,en;q=0.9' --compressed

      wget -A pdf -nc -E -nd --no-check-certificate --content-disposition http://anno.onb.ac.at/pdfs/ONB_lzg_$date.pdf
  done
done
查看更多
登录 后发表回答