Can/Should I run this code on a GPU?

2019-03-07 23:36发布

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.

Several methods performing different, but independent, calculations on the array in nested loops, for example:

Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();

for (float x = 0f; x < 100f; x += 0.0001f) {
    int noOfOccurrences = 0;

    foreach (float y in largeFloatingPointArray) {
        if (x == y) {
            noOfOccurrences++;
        }
    }

    noOfNumbers.Add(x, noOfOccurrences);
}

The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:

  • Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?
  • If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?

Any help would be highly appreciated.

5条回答
戒情不戒烟
2楼-- · 2019-03-07 23:46

I don't know much of anything about parallel processing or GPGPU, but for this specific example, you could save a lot of time by making a single pass over the input array rather than looping over it a million times. With large data sets you will usually want to do things in a single pass if possible. Even if you're doing multiple independent computations, if it's over the same data set you might get better speed doing them all in the same pass, as you'll get better locality of reference that way. But it may not be worth it for the increased complexity in your code.

In addition, you really don't want to add a small amount to a floating point number repetitively like that, the rounding error will add up and you won't get what you intended. I've added an if statement to my below sample to check if inputs match your pattern of iteration, but omit it if you don't actually need that.

I don't know any C#, but a single pass implementation of your sample would look something like this:

Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();

foreach (float x in largeFloatingPointArray)
{
    if (math.Truncate(x/0.0001f)*0.0001f == x)
    {
        if (noOfNumbers.ContainsKey(x))
            noOfNumbers.Add(x, noOfNumbers[x]+1);
        else
            noOfNumbers.Add(x, 1);
    }
}

Hope this helps.

查看更多
萌系小妹纸
3楼-- · 2019-03-07 23:47

Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?

  • Definitely YES, this kind of algorithm is typically the ideal candidate for massive data-parallelism processing, the thing GPUs are so good at.

If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?

  • When you want to go the GPGPU way you have two alternatives : CUDA or OpenCL.

    CUDA is mature with a lot of tools but is NVidia GPUs centric.

    OpenCL is a standard running on NVidia and AMD GPUs, and CPUs too. So you should really favour it.

  • For tutorial you have an excellent series on CodeProject by Rob Farber : http://www.codeproject.com/Articles/Rob-Farber#Articles

  • For your specific use-case there is a lot of samples for histograms buiding with OpenCL (note that many are image histograms but the principles are the same).

  • As you use C# you can use bindings like OpenCL.Net or Cloo.

  • If your array is too big to be stored in the GPU memory, you can block-partition it and rerun your OpenCL kernel for each part easily.

查看更多
戒情不戒烟
4楼-- · 2019-03-07 23:52

UPDATE GPU Version

__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
    int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
    float y;                                         // compute one (or more) floats
    int noOfOccurrences = 0;
    int a;

    while( x < size )            // While there is work to do each thread will:
    {
        dictionary[x] = 0;       // Initialize the position in each it will work
        noOfOccurrences = 0;    

        for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
        {                                                     // that are equal 
                                                             // to it assign float
           y = largeFloatingPointArray[j];  // Take a candidate from the floats array 
           y *= 10000;                      // e.g if y = 0.0001f;
           a = y + 0.5;                     // a = 1 + 0.5 = 1;
           if (a == x) noOfOccurrences++;    
        }                                      

        dictionary[x] += noOfOccurrences; // Update in the dictionary 
                                          // the number of times that the float appears 

    x += blockDim.x * gridDim.x;  // Update the position here the thread will work
    }
}

This one I just tested for smaller inputs, because I am testing I my laptop. Nevertheless, it did work. However, it necessary to do furthers testes.

UPDATE Sequential Version

I just did this naive version that perform your algorithm for 30,000,000 in less than 20 seconds (already counting function to generate data).

Basically, it sort your array of floats. It will travel over the sorted array, analyzing the number of times a value consecutively appears in the array and then put this value in a dictionary along with the number of times it appear.

You can use sorted map, instead of the unordered_map that I used.

Heres the code:

#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>


typedef std::tr1::unordered_map<float, int> Mymap;


void generator(float *data, long int size)
{
    float LO = 0.0;
    float HI = 100.0;

    for(long int i = 0; i < size; i++)
        data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}

void print_array(float *data, long int size)
{

    for(long int i = 2; i < size; i++)
        printf("%f\n",data[i]);

}

std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
    float previous = data[0];
    int count = 1;
    std::tr1::unordered_map<float, int> dict;

    for(long int i = 1; i < size; i++)
    {
        if(previous == data[i])
            count++;
        else
        {
          dict.insert(Mymap::value_type(previous,count));
          previous = data[i];
          count = 1;         
        }

    }
    dict.insert(Mymap::value_type(previous,count)); // add the last member
    return dict;

}

void printMAP(std::tr1::unordered_map<float, int> dict)
{
   for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
  {
     std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
   }
}


int main(int argc, char** argv)
{
  int size = 1000000; 
  if(argc > 1) size = atoi(argv[1]);
  printf("Size = %d",size);

  float data[size];
  using namespace __gnu_cxx;

  std::tr1::unordered_map<float, int> dict;

  generator(data,size);

  sort(data, data + size);
  dict = fill_dict(data,size);

  return 0;
}

If you have the library thrust installed in you machine you should use this:

#include <thrust/sort.h>
thrust::sort(data, data + size);

instead of this

sort(data, data + size);

For sure it will be faster.

Original Post

"I'm working on a statistical application which has a large array containin 10 - 30 millions of floating point values".

"Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?"

Yes, it is. A month ago I put a Molecular Dynamic simulation entirely on the GPU. One of the kernels, that calculates the force between pairs of particles, receive 6 array each one with 500,000 doubles, a total of 3 Millions doubles (22 MB).

So you are planing to put 30 Millions of float points this is about 114 MB of global Memory, so this is not a problem, even my laptop have 250MB.

The number of calculation can be a issue in your case? Based on my experience with the Molecular Dynamic (MD) I say no. The sequential MD version takes about 25 hours to complete while in GPU took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the Molecular Dynamic.

Here's the force calculation example:

__global__ void add(double *fx, double *fy, double *fz,
                    double *x, double *y, double *z,...){

     int pos = (threadIdx.x + blockIdx.x * blockDim.x); 

     ...

     while(pos < particles)
     {

      for (i = 0; i < particles; i++)
      {
              if(//inside of the same radius)
                {
                 // calculate force
                } 
       }
     pos += blockDim.x * gridDim.x;  
     }        
  }

A simple example of a code in Cuda could be the sum of two 2D arrays:

In c:

for(int i = 0; i < N; i++)
    c[i] = a[i] + b[i]; 

In Cuda:

__global__ add(int *c, int *a, int*b, int N)
{
  int pos = (threadIdx.x + blockIdx.x)
  for(; i < N; pos +=blockDim.x)
      c[pos] = a[pos] + b[pos];
}

In Cuda you basically took each for iteration and divide by each thread,

1) threadIdx.x + blockIdx.x*blockDim.x;

Each block have a Id from 0 to N-1 (N the number maximum of blocks) and each block have a X number of threads with an id from 0 to X-1.

1) Gives you the for iteration that each thread will compute based on it id and the block id where the thread is in, the blockDim.x is the number of thread that a block have.

So if you have 2 blocks each one with 10 threads and a N = 40, the:

Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39

Looking to your code I made this draft of what could be it in cuda:

__global__ hash (float *largeFloatingPointArray, int *dictionary)
    // You can turn the dictionary in one array of int
    // here each position will represent the float
    // Since  x = 0f; x < 100f; x += 0.0001f
    // you can associate each x to different position
    // in the dictionary:

    // pos 0 have the same meaning as 0f;
    // pos 1 means float 0.0001f
    // pos 2 means float 0.0002f ect.
    // Then you use the int of each position 
    // to count how many times that "float" had appeared 


   int x = blockIdx.x;  // Each block will take a different x to work
    float y;

while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
    int noOfOccurrences = 0;
    float z = converting_int_to_float(x); // This function will convert the x to the
                                          // float like you use (x / 0.0001)

    // each thread of each block
    // will takes the y from the array of largeFloatingPointArray

    for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
    {
        y = largeFloatingPointArray[j];
        if (z == y)
        {
            noOfOccurrences++;
        }
    }
    if(threadIdx.x == 0) // Thread master will update the values
      atomicAdd(&dictionary[x], noOfOccurrences);
    __syncthreads();
}

You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences at the same time, so you have to unsure mutual exclusion.

This is only one approach you can even give the iterations of the outer loop to the threads instead of the blocks.

Tutorials

The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.

and anothers:

Take a look on the last item, you will find many link to learn CUDA.

OpenCL: OpenCL Tutorials | MacResearch

查看更多
兄弟一词,经得起流年.
5楼-- · 2019-03-08 00:01

In addition to the suggestion by the above poster use the TPL (task parallel library) when appropriate to run in parallel on multiple cores.

The example above could use Parallel.Foreach and ConcurrentDictionary, but a more complex map-reduce setup where the array is split into chunks each generating an dictionary which would then be reduced to a single dictionary would give you better results.

I don't know whether all your computations map correctly to the GPU capabilities, but you'll have to use a map-reduce algorithm anyway to map the calculations to the GPU cores and then reduce the partial results to a single result, so you might as well do that on the CPU before moving on to a less familiar platform.

查看更多
看我几分像从前
6楼-- · 2019-03-08 00:06

I am not sure whether using GPUs would be a good match given that 'largerFloatingPointArray' values need to be retrieved from memory. My understanding is that GPUs are better suited for self contained calculations.

I think turning this single process application into a distributed application running on many systems and tweaking the algorithm should speed things up considerably, depending how many systems are available.

You can use the classic 'divide and conquer' approach. The general approach I would take is as follows.

Use one system to preprocess 'largeFloatingPointArray' into a hash table or a database. This would be done in a single pass. It would use floating point value as the key, and the number of occurrences in the array as the value. Worst case scenario is that each value only occurs once, but that is unlikely. If largeFloatingPointArray keeps changing each time the application is run then in-memory hash table makes sense. If it is static, then the table could be saved in a key-value database such as Berkeley DB. Let's call this a 'lookup' system.

On another system, let's call it 'main', create chunks of work and 'scatter' the work items across N systems, and 'gather' the results as they become available. E.g a work item could be as simple as two numbers indicating the range that a system should work on. When a system completes the work, it sends back array of occurrences and it's ready to work on another chunk of work.

The performance is improved because we do not keep iterating over largeFloatingPointArray. If lookup system becomes a bottleneck, then it could be replicated on as many systems as needed.

With large enough number of systems working in parallel, it should be possible to reduce the processing time down to minutes.

I am working on a compiler for parallel programming in C targeted for many-core based systems, often referred to as microservers, that are/or will be built using multiple 'system-on-a-chip' modules within a system. ARM module vendors include Calxeda, AMD, AMCC, etc. Intel will probably also have a similar offering.

I have a version of the compiler working, which could be used for such an application. The compiler, based on C function prototypes, generates C networking code that implements inter-process communication code (IPC) across systems. One of the IPC mechanism available is socket/tcp/ip.

If you need help in implementing a distributed solution, I'd be happy to discuss it with you.

Added Nov 16, 2012.

I thought a little bit more about the algorithm and I think this should do it in a single pass. It's written in C and it should be very fast compared with what you have.

/*
 * Convert the X range from 0f to 100f in steps of 0.0001f
 * into a range of integers 0 to 1 + (100 * 10000) to use as an
 * index into an array.
 */

#define X_MAX           (1 + (100 * 10000))

/*
 * Number of floats in largeFloatingPointArray needs to be defined
 * below to be whatever your value is.
 */

#define LARGE_ARRAY_MAX (1000)

main()
{
    int j, y, *noOfOccurances;
    float *largeFloatingPointArray;

    /*
     * Allocate memory for largeFloatingPointArray and populate it.
     */

    largeFloatingPointArray = (float *)malloc(LARGE_ARRAY_MAX * sizeof(float));    
    if (largeFloatingPointArray == 0) {
        printf("out of memory\n");
        exit(1);
    }

    /*
     * Allocate memory to hold noOfOccurances. The index/10000 is the
     * the floating point number.  The contents is the count.
     *
     * E.g. noOfOccurances[12345] = 20, means 1.2345f occurs 20 times
     * in largeFloatingPointArray.
     */

    noOfOccurances = (int *)calloc(X_MAX, sizeof(int));
    if (noOfOccurances == 0) {  
        printf("out of memory\n");
        exit(1);
    }

    for (j = 0; j < LARGE_ARRAY_MAX; j++) {
        y = (int)(largeFloatingPointArray[j] * 10000);
        if (y >= 0 && y <= X_MAX) {
            noOfOccurances[y]++;
        }   
    }
}
查看更多
登录 后发表回答