Ubuntu terminal - using gnu parallel to read lines

2019-06-27 23:09发布

I am Trying to count the lines in all the files in a very large folder under Ubuntu.

The files are .gz files and I use

zcat * | wc -l

to count all the lines in all the files, and it's slow!

I want to use multi core computing for this task and found this about Gnu parallel,

I tried to use this bash command:

parallel zcat * | parallel --pipe wc -l

and the cores are not all working I found that the job starting might cause major overhead and tried using batching with

parallel -X zcat * | parallel --pipe -X wc -l

without improvenemt,

how can I use all the cores to count the lines in all the files in a folder given they are all .gz files and need to be decompresses before counting the rows (don't need to keep them uncompressed after)

Thanks!

2条回答
Luminary・发光体
2楼-- · 2019-06-27 23:30

Basically the command you are looking for is:

ls *gz | parallel 'zcat {} | wc -l'

What it does is:

  • ls *gzlist all gz files on stdout
  • Pipe it to parallel
  • Spawn subshells with parallel
  • Run in said subshells the command inside quotes 'zcat {} | wc -l'

About the '{}', according to the manual:

This replacement string will be replaced by a full line read from the input source

So each line piped to parallel get fed to zcat.

Of course this is basic, I assume it could be tuned, the documentation and examples might help

查看更多
贪生不怕死
3楼-- · 2019-06-27 23:35

If you have 150,000 files, you will likely get problems with "argument list too long". You can avoid that like this:

find . -name \*gz -maxdepth 1 -print0 | parallel -0 ...

If you want the name beside the line count, you will have to echo it yourself, since your wc process will only be reading from its stdin and won't know the filename:

find ... | parallel -0 'echo {} $(zcat {} | wc -l)'

Next, we come to efficiency and it will depend on what your disks are capable of. Maybe try with parallel -j2 then parallel -j4 and see what works on your system.


As Ole helpfully points out in the comments, you can avoid having to output the name of the file whose lines are being counted by using GNU Parallel's --tag option to tag output line, so this is even more efficient:

find ... | parallel -0 --tag 'zcat {} | wc -l'
查看更多
登录 后发表回答