Compression library using Nvidia's CUDA [close

2020-05-11 11:16发布

Does anyone know a project which implements standard compression methods (like Zip, GZip, BZip2, LZMA,...) using NVIDIA's CUDA library?

I was wondering if algorithms which can make use of a lot of parallel tasks (like compression) wouldn't run much faster on a graphics card than with a dual or quadcore CPU.

What do you think about the pros and cons of such an approach?

6条回答
何必那么认真
2楼-- · 2020-05-11 11:45

30% is nice, but for applications like backups it's not enough by a long shot.

My experience is that the average data stream in such instances gets 1.2-1.7:1 compression using gzip and ends up limited to an output rate of 30-60Mb/s (this is across a wide range of modern (circa 2010-2012) medium-high-end CPUs.

The limitation here is usually the speed at which data can be fed into the CPU itself.

Unfortunately, in order to keep a LTO5 tape drive happy, it needs a raw (uncompressable) data rate of around 160Mb/s. If fed compressable data it requires even faster data rates.

LTO compression is clearly a lot faster, but somewhat inefficient (equivalent to gzip -1 - it's good enough for most purposes). LTO4 drives and upwards usually have built in AES-256 encryption engines which can also maintain these kinds of speeds.

What this means for my case is that I'd need a 400% or better impreovement in order to consider it worthwhile.

Similar considerations apply across LANs. At 30Mb/s, compression is a hinderance on Gb-class networks and the question is whether to spend more on networking or on compression... :)

查看更多
成全新的幸福
3楼-- · 2020-05-11 11:53

Not aware of anyone having done that and made it public. Just IMHO, it doesn't sound very promising.

As Martinus points out, some compression algorithms are highly serial. Block compression algorithms like LZW can be parallelized by coding each block independently. Ziping a large tree of files can be parallelized at the file level.

However, none of these is really SIMD-style parallelism (Single Instruction Multiple Data), and they're not massively parallel.

GPUs are basically vector processors, where you can be doing hundreds or thousands of ADD instructions all in lock step, and executing programs where there are very few data-dependent branches.

Compression algorithms in general sound more like an SPMD (Single Program Multiple Data) or MIMD (Multiple Instruction Multiple Data) programming model, which is better suited to multicore cpus.

Video compression algorithms can be accellerated by GPGPU processing like CUDA only to the extent that there is a very large number of pixel blocks that are being cosine-transformed or convolved (for motion detection) in parallel, and the IDCT or convolution subroutines can be expressed with branchless code.

GPUs also like algorithms that have high numeric intensity (the ratio of math operations to memory accesses.) Algorithms with low numeric intensity (like adding two vectors) can be massively parallel and SIMD, but still run slower on the gpu than the cpu because they're memory bound.

查看更多
Animai°情兽
4楼-- · 2020-05-11 11:56

Encryption algorithms have been quite successful in this area, so you might want to look into that. Here is a paper related to CUDA and AES encryption:http://www.manavski.com/downloads/PID505889.pdf

查看更多
女痞
5楼-- · 2020-05-11 11:59

We have finished first phase of research to increase performance of lossless data compression algorithms. Bzip2 was chosen for the prototype, our team optimized only one operation - Burrows–Wheeler transformation, and we got some results: 2x-4x speed up on good compressible files. The code works faster on all our tests.

We are going to complete bzip2, support deflate and LZMA for some real life tasks like: HTTP traffic and backups compression.

blog link: http://www.wave-access.com/public_en/blog/2011/april/22/breakthrough-in-cuda-data-compression.aspx

查看更多
趁早两清
6楼-- · 2020-05-11 12:06

We're making an attempt at porting bzip2 to CUDA. :) So far (and with only rough tests done), our Burrows-Wheeler Transform is 30% faster than the serial algorithm. http://bzip2.github.com

查看更多
够拽才男人
7楼-- · 2020-05-11 12:11

Typically compression algorithms cannot make use of parallel tasks, it is not easy to make the algorithms highly parallelizeable. In your examples, TAR is not a compression algorithm, and the only algorithm that might be highly parallelizeable is BZIP because it is a block compression algorithm. Each block can be compressed separately, but this would require lots and lots of memory. LZMA does not work in parallel either, when you see 7zip using multiple threads this is because 7zip splits the data stream into 2 different streams that each are compressed with LZMA in a separate thread, so the compression algorithm itself is not paralllel. This splitting only works when the data permits this.

查看更多
登录 后发表回答