Ubuntu terminal - using gnu parallel to read lines

I am Trying to count the lines in all the files in a very large folder under Ubuntu.

The files are .gz files and I use

zcat * | wc -l

to count all the lines in all the files, and it's slow!

I want to use multi core computing for this task and found this about Gnu parallel,

I tried to use this bash command:

parallel zcat * | parallel --pipe wc -l

and the cores are not all working I found that the job starting might cause major overhead and tried using batching with

parallel -X zcat * | parallel --pipe -X wc -l

without improvenemt,

how can I use all the cores to count the lines in all the files in a folder given they are all .gz files and need to be decompresses before counting the rows (don't need to keep them uncompressed after)

Thanks!

标签： multithreading bash ubuntu parallel-processing gnu-parallel

2条回答

Luminary・发光体

2楼-- · 2019-06-27 23:30

Basically the command you are looking for is:

ls *gz | parallel 'zcat {} | wc -l'

What it does is:

ls *gzlist all gz files on stdout
Pipe it to parallel
Spawn subshells with parallel
Run in said subshells the command inside quotes 'zcat {} | wc -l'

About the '{}', according to the manual:

This replacement string will be replaced by a full line read from the input source

So each line piped to parallel get fed to zcat.

Of course this is basic, I assume it could be tuned, the documentation and examples might help

0人赞添加讨论(0) 举报

贪生不怕死

3楼-- · 2019-06-27 23:35

If you have 150,000 files, you will likely get problems with "argument list too long". You can avoid that like this:

find . -name \*gz -maxdepth 1 -print0 | parallel -0 ...

If you want the name beside the line count, you will have to echo it yourself, since your wc process will only be reading from its stdin and won't know the filename:

find ... | parallel -0 'echo {} $(zcat {} | wc -l)'

Next, we come to efficiency and it will depend on what your disks are capable of. Maybe try with parallel -j2 then parallel -j4 and see what works on your system.

As Ole helpfully points out in the comments, you can avoid having to output the name of the file whose lines are being counted by using GNU Parallel's --tag option to tag output line, so this is even more efficient:

find ... | parallel -0 --tag 'zcat {} | wc -l'

0人赞添加讨论(0) 举报

Ubuntu terminal - using gnu parallel to read lines

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间