Parallel WGET download in bash script

2019-06-12 16:06发布

I have this small script to download images from a given list in a file.

FILE=./img-url.txt
while read line; do
url=$line
wget -N -P /images/ $url
wget -N  -P /images/ ${url%.jpg}_{001..005}.jpg
done < $FILE

The problem is, that It runs too long (>5000 lines in the file). Is there any way to speed up things? Like split source txt into separate files and run multiple wget instances at the same time.

标签: bash wget
1条回答
做个烂人
2楼-- · 2019-06-12 17:01

There are a number of ways to go about this. GNU Parallel would be the most general solution, but given how you posed your question, yes, split the file into parts and run the script on each part simultaneously. How many pieces to split the file into is an interesting question. 100 pieces would mean spawning 100 wget processes simultaneously. Almost all of those will sit idle while a very few utilize all the network bandwidth. One process might utilize all the bandwidth for an hour for all I know, but I'm going to guess a good compromise is to split the file into four files, so 4 wget processes run simultaneously. I'm going to call your script geturls.sh. Type this at the command line.

split -l 4 img-url.txt
for f in xaa xab xac xad; do
    ./geturls.sh $f &
done

This splits your file into four ~even pieces. The split command output files are by default given some bland file names, in this case xaa, xab, etc. The for loop takes the names of those pieces and gives them to geturl.sh as a command line argument, the first thing on the command line after the program name. The geturls.sh is put into the background (&) so the next iteration of the loop can happen immediately. In this way geturls.sh is run on all four pieces of the file virtually simultaneously, so you've got 4 wget processes going at the same time.

The contents of geturls.sh is

#!/bin/bash
FILE=$1
while read line; do
url=$line
wget -N -P /images/ $url
wget -N  -P /images/ ${url%.jpg}_{001..005}.jpg
done < $FILE

The only change I made to your code was the explicit declaration of the shell (out of habit mostly) and also that FILE is now assigned the value in the $1 variable. Recall that $1 is the (first) command line argument, which is here the name of one of the pieces of your img-url.txt file.

查看更多
登录 后发表回答