Download images from website

2019-01-17 14:03发布

问题:

I want to have a local copy of a gallery on a website. The gallery shows the pictures at domain.com/id/1 (id increases in increments of 1) and then the image is stored at pics.domain.com/pics/original/image.format. The exact line that the images have in the HTML are

<div id="bigwall" class="right"> 
    <img border=0 src='http://pics.domain.com/pics/original/image.jpg' name='pic' alt='' style='top: 0px; left: 0px; margin-top: 50px; height: 85%;'> 
</div>

So I want to write a script that does something like this (in pseudo-code):

for(id = 1; id <= 151468; id++) {
     page = "http://domain.com/id/" + id.toString();
     src = returnSrc(); // Searches the html for img with name='pic' and saves the image location as a string
     getImg(); // Downloads the file named in src
}

I'm not sure exactly how to do this though. I suppose I could do it in bash, using wget to download the html and then search the html manually for http://pics.domain.com/pics/original/. then use wget again to save the file, remove the html file, increment the id and repeat. The only thing is I'm not good at handling strings, so if anyone could tell me how to search for the url and replace the *s with the file name and format I should be able to get the rest going. Or if my method is stupid and you have a better one please share.

# get all pages
curl 'http://domain.com/id/[1-151468]' -o '#1.html'

# get all images
grep -oh 'http://pics.domain.com/pics/original/.*jpg' *.html >urls.txt

# download all images
sort -u urls.txt | wget -i-