I normally compress using tar zcvf
and decompress using tar zxvf
(using gzip due to habit).
I've recently gotten a quad core CPU with hyperthreading, so I have 8 logical cores, and I notice that many of the cores are unused during compression/decompression.
Is there any way I can utilize the unused cores to make it faster?
You can use pigz instead of gzip, which does gzip compression on multiple cores. Instead of using the -z option, you would pipe it through pigz:
tar cf - paths-to-archive | pigz > archive.tar.gz
By default, pigz uses the number of available cores, or eight if it could not query that. You can ask for more with -p n, e.g. -p 32. pigz has the same options as gzip, so you can request better compression with -9. E.g.
tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gz
You can also use the tar flag "--use-compress-program=" to tell tar what compression program to use.
For example use:
tar -c --use-compress-program=pigz -f tar.file dir_to_zip
Common approach
There is option for tar
program:
-I, --use-compress-program PROG
filter through PROG (must accept -d)
You can use multithread version of archiver or compressor utility.
Most popular multithread archivers are pigz (instead of gzip) and pbzip2 (instead of bzip2). For instance:
$ tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 paths_to_archive
$ tar --use-compress-program=pigz -cf OUTPUT_FILE.tar.gz paths_to_archive
Archiver must accept -d. If your replacement utility hasn't this parameter and/or you need specify additional parameters, then use pipes (add parameters if necessary):
$ tar cf - paths_to_archive | pbzip2 > OUTPUT_FILE.tar.gz
$ tar cf - paths_to_archive | pigz > OUTPUT_FILE.tar.gz
Input and output of singlethread and multithread are compatible. You can compress using multithread version and decompress using singlethread version and vice versa.
p7zip
For p7zip for compression you need a small shell script like the following:
#!/bin/sh
case $1 in
-d) 7za -txz -si -so e;;
*) 7za -txz -si -so a .;;
esac 2>/dev/null
Save it as 7zhelper.sh. Here the example of usage:
$ tar -I 7zhelper.sh -cf OUTPUT_FILE.tar.7z paths_to_archive
$ tar -I 7zhelper.sh -xf OUTPUT_FILE.tar.7z
xz
Regarding multithreaded XZ support. If you are running version 5.2.0 or above of XZ Utils, you can utilize multiple cores for compression by setting -T
or --threads
to an appropriate value via the environmental variable XZ_DEFAULTS (e.g. XZ_DEFAULTS="-T 0"
).
This is a fragment of man for 5.1.0alpha version:
Multithreaded compression and decompression are not implemented yet, so this
option has no effect for now.
However this will not work for decompression of files that haven't also
been compressed with threading enabled. From man for version 5.2.2:
Threaded decompression hasn't been implemented yet. It will only work
on files that contain multiple blocks with size information in
block headers. All files compressed in multi-threaded mode meet this
condition, but files compressed in single-threaded mode don't even if
--block-size=size is used.
Recompiling with replacement
If you build tar from sources, then you can recompile with parameters
--with-gzip=pigz
--with-bzip2=lbzip2
--with-lzip=plzip
After recompiling tar with these options you can check the output of tar's help:
$ tar --help | grep "lbzip2\|plzip\|pigz"
-j, --bzip2 filter the archive through lbzip2
--lzip filter the archive through plzip
-z, --gzip, --gunzip, --ungzip filter the archive through pigz
You can use the shortcut -I
for tar's --use-compress-program
switch, and invoke pbzip2
for bzip2 compression on multiple cores:
tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 DIRECTORY_TO_COMPRESS/
If you want to have more flexibility with filenames and compression options, you can use:
find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec \
tar -P --transform='s@/my/path/@@g' -cf - {} + | \
pigz -9 -p 4 > myarchive.tar.gz
Step 1: find
find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec
This command will look for the files you want to archive, in this case /my/path/*.sql
and /my/path/*.log
. Add as many -o -name "pattern"
as you want.
-exec
will execute the next command using the results of find
: tar
Step 2: tar
tar -P --transform='s@/my/path/@@g' -cf - {} +
--transform
is a simple string replacement parameter. It will strip the path of the files from the archive so the tarball's root becomes the current directory when extracting. Note that you can't use -C
option to change directory as you'll lose benefits of find
: all files of the directory would be included.
-P
tells tar
to use absolute paths, so it doesn't trigger the warning "Removing leading `/' from member names". Leading '/' with be removed by --transform
anyway.
-cf -
tells tar
to use the tarball name we'll specify later
{} +
uses everyfiles that find
found previously
Step 3: pigz
pigz -9 -p 4
Use as many parameters as you want.
In this case -9
is the compression level and -p 4
is the number of cores dedicated to compression.
If you run this on a heavy loaded webserver, you probably don't want to use all available cores.
Step 4: archive name
> myarchive.tar.gz
Finally.