I am Trying to count the lines in all the files in a very large folder under Ubuntu.
The files are .gz files and I use
zcat * | wc -l
to count all the lines in all the files, and it's slow!
I want to use multi core computing for this task and found this
about Gnu parallel,
I tried to use this bash command:
parallel zcat * | parallel --pipe wc -l
and the cores are not all working
I found that the job starting might cause major overhead and tried using batching with
parallel -X zcat * | parallel --pipe -X wc -l
without improvenemt,
how can I use all the cores to count the lines in all the files in a folder given they are all .gz files and need to be decompresses before counting the rows (don't need to keep them uncompressed after)
Thanks!
If you have 150,000 files, you will likely get problems with "argument list too long". You can avoid that like this:
find . -name \*gz -maxdepth 1 -print0 | parallel -0 ...
If you want the name beside the line count, you will have to echo
it yourself, since your wc
process will only be reading from its stdin
and won't know the filename:
find ... | parallel -0 'echo {} $(zcat {} | wc -l)'
Next, we come to efficiency and it will depend on what your disks are capable of. Maybe try with parallel -j2
then parallel -j4
and see what works on your system.
As Ole helpfully points out in the comments, you can avoid having to output the name of the file whose lines are being counted by using GNU Parallel's --tag
option to tag output line, so this is even more efficient:
find ... | parallel -0 --tag 'zcat {} | wc -l'
Basically the command you are looking for is:
ls *gz | parallel 'zcat {} | wc -l'
What it does is:
ls *gz
list all gz files on stdout
- Pipe it to
parallel
- Spawn subshells with
parallel
- Run in said subshells the command inside quotes
'zcat {} | wc -l'
About the '{}', according to the manual:
This replacement string will be replaced by a full line read from the input source
So each line piped to parallel get fed to zcat.
Of course this is basic, I assume it could be tuned, the documentation and examples might help