allocate unified memory in my program. aftering ru

Before asking this, I have read this question , which is similar to mine.

Here I will provide my program in detail.

#define N 70000
#define M 1000

class ObjBox
{public:

    int oid; float x; float y; float ts};

class Bucket
{public:

    int bid; int nxt; ObjBox *arr_obj; int nO;}

int main()
{

   Bucket *arr_bkt;

   cudaMallocManaged(&arr_bkt, N * sizeof(Bucket));

   for (int i = 0; i < N; i++)

   {

       arr_bkt[i].bid = i; 

       arr_bkt[i].nxt = -1;

       arr_bkt[i].nO = 0;

       cudaError_t r = cudaMallocManaged(&(arr_bkt[i].arr_obj), M * sizeof(ObjBox));

       if (r != cudaSuccess)

       {

           printf("CUDA Error on %s\n", cudaGetErrorString(r));

           exit(0);

       }

       for (int j = 0; j < M; j++)

       {

           arr_bkt[i].arr_obj[j].oid = -1;

           arr_bkt[i].arr_obj[j].x = -1;

           arr_bkt[i].arr_obj[j].y = -1;

           arr_bkt[i].arr_obj[j].ts = -1;

        }

   }

   cout << "Bucket Array Initial Completed..." << endl;

   cudaFree(arr_bkt);

   return 0;

}

In my main program, I allocate an array of type Bucket,which has a nested array ObjBox. There are totally N(70000) Bucket in the array, M(1000) ObjBox in each Bucket. I can compile my program normally and get out of memory error when running,the error lies in the line cudaError_t r = cudaMallocManaged(&(arr_bkt[i].arr_obj), M * sizeof(ObjBox));

I have tried to solve the question for long, here are some point I find:

1, When N is smaller, such ad 30000, 40000， 60000 even, the program can work normally.That is, it can allocate so much unified memory in a structure;

2, There are still many free memory.In my server,there are 16G host memory and 11G GPU global memory. But in this program, the Bucket array consumes nearly

 N * M * sizeof(ObjBox) = 70000 * 1000 * 16Byte = 1120M;

3, The value M nearly has nothing to do with the out of memory error; When N remains the same(70000), M decreases to 100, the program breaks too;

The type of my GPU is Tesla K40c, I have present my question to my tutor, and she present this to her friend, her friend runs the program in her Tesla K20 with CUDA version 7.0, it works and can normally allocate the structure.

How is it going?how can I allocate the structure in my Tesla K40c? My tutor think there might be some limited settings in the GPU driver settings, but I can't solve it yet;

If I modify your code with some instrumentation, like this:

#include <cstdio>
#include <iostream>

#define N 70000
#define M 1000

class ObjBox
{
    public:

        int oid; 
        float x; 
        float y; 
        float ts;
};

class Bucket
{
    public:

        int bid; 
        int nxt; 
        ObjBox *arr_obj; 
        int nO;
};

int main()
{

    Bucket *arr_bkt;
    cudaMallocManaged(&arr_bkt, N * sizeof(Bucket));

    for (int i = 0; i < N; i++) {
        arr_bkt[i].bid = i; 
        arr_bkt[i].nxt = -1;
        arr_bkt[i].nO = 0;

        size_t allocsz = size_t(M) * sizeof(ObjBox);
        cudaError_t r = cudaMallocManaged(&(arr_bkt[i].arr_obj), allocsz);
        if (r != cudaSuccess) {
            printf("CUDA Error on %s\n", cudaGetErrorString(r));
            exit(0);
        } else {
            size_t total_mem, free_mem;
            cudaMemGetInfo(&free_mem, &total_mem);
            std::cout << i << ":Allocated " << allocsz;
            std::cout << " Currently " << free_mem << " bytes free" << std::endl;
        } 

        for (int j = 0; j < M; j++) {
            arr_bkt[i].arr_obj[j].oid = -1;
            arr_bkt[i].arr_obj[j].x = -1;
            arr_bkt[i].arr_obj[j].y = -1;
            arr_bkt[i].arr_obj[j].ts = -1;
        }
    }

    std::cout << "Bucket Array Initial Completed..." << std::endl;
    cudaFree(arr_bkt);

    return 0;
}

and compile and run it on a unified memory system with 16Gb physical host memory and 2Gb physical device memory with the Linux 352.39 driver, I get this:

0:Allocated 16000 Currently 2099871744 bytes free
1:Allocated 16000 Currently 2099871744 bytes free
2:Allocated 16000 Currently 2099871744 bytes free
3:Allocated 16000 Currently 2099871744 bytes free
4:Allocated 16000 Currently 2099871744 bytes free
5:Allocated 16000 Currently 2099871744 bytes free
6:Allocated 16000 Currently 2099871744 bytes free
7:Allocated 16000 Currently 2099871744 bytes free
8:Allocated 16000 Currently 2099871744 bytes free
9:Allocated 16000 Currently 2099871744 bytes free
....
....
....
65445:Allocated 16000 Currently 1028161536 bytes free
65446:Allocated 16000 Currently 1028161536 bytes free
65447:Allocated 16000 Currently 1028161536 bytes free
65448:Allocated 16000 Currently 1028161536 bytes free
65449:Allocated 16000 Currently 1028161536 bytes free
65450:Allocated 16000 Currently 1028161536 bytes free
65451:Allocated 16000 Currently 1028161536 bytes free
CUDA Error on out of memory

i.e. it reports out of memory with plenty of free memory remaining on the device.

I think the key to understanding this is the number of allocations, at the failure point, rather than their size. 65451 is suspiciously close to 65535 (i.e. 2^16). Allowing for the internal memory allocations that the runtime makes, I am going to guess that there is some sort of accidental or deliberate limit on the total number of memory managed memory allocations to 65535.

I would be very interested to see whether you can reproduce this. If you can, I would be contemplating filing a bug report with NVIDIA.