How do you to split a very large directory, containing potentially millions of files, into smaller directories of some custom defined maximum number of files, such as 100 per directory, on UNIX?
Bonus points if you know of a way to have wget
download files into these subdirectories automatically. So if there are 1 million .html
pages at the top-level path at www.example.com
, such as
/1.html
/2.html
...
/1000000.html
and we only want 100 files per directory, it will download them to folders something like
./www.example.com/1-100/1.html
...
./www.example.com/999901-1000000/1000000.html
Only really need to be able to run the UNIX command on the folder after wget
has downloaded the files, but if it's possible to do this with wget
as it's downloading I'd love to know!
Another option:
i=1;while read l;do mkdir $i;mv $l $((i++));done< <(ls|xargs -n100)
Or using parallel
:
ls|parallel -n100 mkdir {#}\;mv {} {#}
-n100
takes 100 arguments at a time and {#}
is the sequence number of the job.
To make ls|parallel more practical to use, add a variable assignment to the destination dir:
DST=../brokenup; ls | parallel -n100 mkdir -p $DST/{#}\;cp {} $DST/{#}
Note: cd <src_large_dir>
before executing.
The DST defined above will contain a copy of the current directory's files, but a maximum of 100 per subdirectory.
You can run this through a couple of loops, which should do the trick (at least for the numeric part of the file name). I think that doing this as a one-liner is over-optimistic.
#! /bin/bash
for hundreds in {0..99}
do
min=$(($hundreds*100+1))
max=$(($hundreds*100+100))
current_dir="$min-$max"
mkdir $current_dir
for ones_tens in {1..100}
do
current_file="$(($hundreds*100+$ones_tens)).html"
#touch $current_file
mv $current_file $current_dir
done
done
I did performance testing by first commenting out mkdir $current_dir
and mv $current_file $current_dir
and uncommenting touch $current_file
. This created 10000 files (one-hundredth of your target of 1000000 files). Once the files were created, I reverted to the script as written:
$ time bash /tmp/test.bash 2>&1
real 0m27.700s
user 0m26.426s
sys 0m17.653s
As long as you aren't moving files across file systems, the time for each mv
command should be constant, so you should see similar or better performance. Scaling this up to a million files would give you around 27700 seconds, i.e. 46 minutes. There are several avenues for optimization, such as moving all files for a given directory in one command, or removing the inner for loop.
Doing the 'wget' to grab a million files is going to take far longer than this, and is almost certainly going to require some optimization; preserving bandwidth in http headers alone will cut down run time by hours. I don't think that a shell script is probably the right tool for that job; using a library such as WWW::Curl on cpan will be much easier to optimize.