I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.
Several methods performing different, but independent, calculations on the array in nested loops, for example:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
for (float x = 0f; x < 100f; x += 0.0001f) {
int noOfOccurrences = 0;
foreach (float y in largeFloatingPointArray) {
if (x == y) {
noOfOccurrences++;
}
}
noOfNumbers.Add(x, noOfOccurrences);
}
The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:
- Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?
- If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?
Any help would be highly appreciated.
I don't know much of anything about parallel processing or GPGPU, but for this specific example, you could save a lot of time by making a single pass over the input array rather than looping over it a million times. With large data sets you will usually want to do things in a single pass if possible. Even if you're doing multiple independent computations, if it's over the same data set you might get better speed doing them all in the same pass, as you'll get better locality of reference that way. But it may not be worth it for the increased complexity in your code.
In addition, you really don't want to add a small amount to a floating point number repetitively like that, the rounding error will add up and you won't get what you intended. I've added an if statement to my below sample to check if inputs match your pattern of iteration, but omit it if you don't actually need that.
I don't know any C#, but a single pass implementation of your sample would look something like this:
Hope this helps.
When you want to go the GPGPU way you have two alternatives : CUDA or OpenCL.
CUDA is mature with a lot of tools but is NVidia GPUs centric.
OpenCL is a standard running on NVidia and AMD GPUs, and CPUs too. So you should really favour it.
For tutorial you have an excellent series on CodeProject by Rob Farber : http://www.codeproject.com/Articles/Rob-Farber#Articles
For your specific use-case there is a lot of samples for histograms buiding with OpenCL (note that many are image histograms but the principles are the same).
As you use C# you can use bindings like OpenCL.Net or Cloo.
If your array is too big to be stored in the GPU memory, you can block-partition it and rerun your OpenCL kernel for each part easily.
UPDATE GPU Version
This one I just tested for smaller inputs, because I am testing I my laptop. Nevertheless, it did work. However, it necessary to do furthers testes.
UPDATE Sequential Version
I just did this naive version that perform your algorithm for 30,000,000 in less than 20 seconds (already counting function to generate data).
Basically, it sort your array of floats. It will travel over the sorted array, analyzing the number of times a value consecutively appears in the array and then put this value in a dictionary along with the number of times it appear.
You can use sorted map, instead of the unordered_map that I used.
Heres the code:
If you have the library thrust installed in you machine you should use this:
instead of this
For sure it will be faster.
Original Post
"I'm working on a statistical application which has a large array containin 10 - 30 millions of floating point values".
"Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?"
Yes, it is. A month ago I put a Molecular Dynamic simulation entirely on the GPU. One of the kernels, that calculates the force between pairs of particles, receive 6 array each one with 500,000 doubles, a total of 3 Millions doubles (22 MB).
So you are planing to put 30 Millions of float points this is about 114 MB of global Memory, so this is not a problem, even my laptop have 250MB.
The number of calculation can be a issue in your case? Based on my experience with the Molecular Dynamic (MD) I say no. The sequential MD version takes about 25 hours to complete while in GPU took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the Molecular Dynamic.
Here's the force calculation example:
A simple example of a code in Cuda could be the sum of two 2D arrays:
In c:
In Cuda:
In Cuda you basically took each for iteration and divide by each thread,
Each block have a Id from 0 to N-1 (N the number maximum of blocks) and each block have a X number of threads with an id from 0 to X-1.
1) Gives you the for iteration that each thread will compute based on it id and the block id where the thread is in, the blockDim.x is the number of thread that a block have.
So if you have 2 blocks each one with 10 threads and a N = 40, the:
Looking to your code I made this draft of what could be it in cuda:
You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences at the same time, so you have to unsure mutual exclusion.
This is only one approach you can even give the iterations of the outer loop to the threads instead of the blocks.
Tutorials
The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.
and anothers:
Take a look on the last item, you will find many link to learn CUDA.
OpenCL: OpenCL Tutorials | MacResearch
In addition to the suggestion by the above poster use the TPL (task parallel library) when appropriate to run in parallel on multiple cores.
The example above could use Parallel.Foreach and ConcurrentDictionary, but a more complex map-reduce setup where the array is split into chunks each generating an dictionary which would then be reduced to a single dictionary would give you better results.
I don't know whether all your computations map correctly to the GPU capabilities, but you'll have to use a map-reduce algorithm anyway to map the calculations to the GPU cores and then reduce the partial results to a single result, so you might as well do that on the CPU before moving on to a less familiar platform.
I am not sure whether using GPUs would be a good match given that 'largerFloatingPointArray' values need to be retrieved from memory. My understanding is that GPUs are better suited for self contained calculations.
I think turning this single process application into a distributed application running on many systems and tweaking the algorithm should speed things up considerably, depending how many systems are available.
You can use the classic 'divide and conquer' approach. The general approach I would take is as follows.
Use one system to preprocess 'largeFloatingPointArray' into a hash table or a database. This would be done in a single pass. It would use floating point value as the key, and the number of occurrences in the array as the value. Worst case scenario is that each value only occurs once, but that is unlikely. If largeFloatingPointArray keeps changing each time the application is run then in-memory hash table makes sense. If it is static, then the table could be saved in a key-value database such as Berkeley DB. Let's call this a 'lookup' system.
On another system, let's call it 'main', create chunks of work and 'scatter' the work items across N systems, and 'gather' the results as they become available. E.g a work item could be as simple as two numbers indicating the range that a system should work on. When a system completes the work, it sends back array of occurrences and it's ready to work on another chunk of work.
The performance is improved because we do not keep iterating over largeFloatingPointArray. If lookup system becomes a bottleneck, then it could be replicated on as many systems as needed.
With large enough number of systems working in parallel, it should be possible to reduce the processing time down to minutes.
I am working on a compiler for parallel programming in C targeted for many-core based systems, often referred to as microservers, that are/or will be built using multiple 'system-on-a-chip' modules within a system. ARM module vendors include Calxeda, AMD, AMCC, etc. Intel will probably also have a similar offering.
I have a version of the compiler working, which could be used for such an application. The compiler, based on C function prototypes, generates C networking code that implements inter-process communication code (IPC) across systems. One of the IPC mechanism available is socket/tcp/ip.
If you need help in implementing a distributed solution, I'd be happy to discuss it with you.
Added Nov 16, 2012.
I thought a little bit more about the algorithm and I think this should do it in a single pass. It's written in C and it should be very fast compared with what you have.