I am Trying to count the lines in all the files in a very large folder under Ubuntu.
The files are .gz files and I use
zcat * | wc -l
to count all the lines in all the files, and it's slow!
I want to use multi core computing for this task and found this about Gnu parallel,
I tried to use this bash command:
parallel zcat * | parallel --pipe wc -l
and the cores are not all working I found that the job starting might cause major overhead and tried using batching with
parallel -X zcat * | parallel --pipe -X wc -l
without improvenemt,
how can I use all the cores to count the lines in all the files in a folder given they are all .gz files and need to be decompresses before counting the rows (don't need to keep them uncompressed after)
Thanks!
Basically the command you are looking for is:
What it does is:
ls *gz
list all gz files on stdoutparallel
parallel
'zcat {} | wc -l'
About the '{}', according to the manual:
So each line piped to parallel get fed to zcat.
Of course this is basic, I assume it could be tuned, the documentation and examples might help
If you have 150,000 files, you will likely get problems with "argument list too long". You can avoid that like this:
If you want the name beside the line count, you will have to
echo
it yourself, since yourwc
process will only be reading from itsstdin
and won't know the filename:Next, we come to efficiency and it will depend on what your disks are capable of. Maybe try with
parallel -j2
thenparallel -j4
and see what works on your system.As Ole helpfully points out in the comments, you can avoid having to output the name of the file whose lines are being counted by using GNU Parallel's
--tag
option to tag output line, so this is even more efficient: