Illegal Memory Access on cudaDeviceSynchronize

I am encountering a very strange bug in that I get an 'illegal memory access' error when running a Heat 2D simulation of a particular size, but the simulation runs well if I run the exact same simulation, just with fewer elements.

Is there a reason that increasing the size of an array would cause this exception? I am using a Titan Black GPU (6 GB of memory), but the simulation I am running is nowhere near that size. I calculated that I could run a 4000 x 4000 simulation, but I get errors if I exceed 250 x 250.

The error occurs immediately after I instantiate the array of simulation objects on the device. Instantiation code is as follows:

template<typename PlaceType, typename StateType>
__global__ void instantiatePlacesKernel(Place** places, StateType *state,
        void *arg, int *dims, int nDims, int qty) {
    unsigned idx = blockDim.x * blockIdx.x + threadIdx.x;

    if (idx < qty) {
        // set pointer to corresponding state object
        places[idx] = new PlaceType(&(state[idx]), arg);
        places[idx]->setIndex(idx);
        places[idx]->setSize(dims, nDims);
    }
}

template<typename PlaceType, typename StateType>
Place** DeviceConfig::instantiatePlaces(int handle, void *argument, int argSize,
        int dimensions, int size[], int qty) {

    // add global constants to the GPU
    memcpy(glob.globalDims,size, sizeof(int) * dimensions);
    updateConstants(glob);

    // create places tracking
    PlaceArray p; // a struct to track qty, 
    p.qty = qty;

    // create state array on device
    StateType* d_state = NULL;
    int Sbytes = sizeof(StateType);
    CATCH(cudaMalloc((void** ) &d_state, qty * Sbytes));
    p.devState = d_state; // save device pointer

    // allocate device pointers
    Place** tmpPlaces = NULL;
    int ptrbytes = sizeof(Place*);
    CATCH(cudaMalloc((void** ) &tmpPlaces, qty * ptrbytes));
    p.devPtr = tmpPlaces; // save device pointer

    // handle arg if necessary
    void *d_arg = NULL;
    if (NULL != argument) {
        CATCH(cudaMalloc((void** ) &d_arg, argSize));
        CATCH(cudaMemcpy(d_arg, argument, argSize, H2D));
    }

    // load places dimensions
    int *d_dims;
    int dimBytes = sizeof(int) * dimensions;
    CATCH(cudaMalloc((void** ) &d_dims, dimBytes));
    CATCH(cudaMemcpy(d_dims, size, dimBytes, H2D));

    // launch instantiation kernel
    int blockDim = (qty - 1) / BLOCK_SIZE + 1;
    int threadDim = (qty - 1) / blockDim + 1;
    Logger::debug("Launching instantiation kernel");
    instantiatePlacesKernel<PlaceType, StateType> <<<blockDim, threadDim>>>(tmpPlaces, d_state,
            d_arg, d_dims, dimensions, qty);
    CHECK();

    CATCH(cudaDeviceSynchronize()); // ERROR OCCURS HERE

    // clean up memory
    if (NULL != argument) {
        CATCH(cudaFree(d_arg));
    }
    CATCH(cudaFree(d_dims));
    CATCH(cudaMemGetInfo(&freeMem, &allMem));

    return p.devPtr;
}

Please assume any custom types you see are working, as this code executes without error on a sufficiently small simulation. I am frustrated that it appears that the number of elements in the kernel function's places and state arrays causes an error when the size exceeds 250 x 250 elements. Any insight would be awesome.

Thank you!

I think it's likely that in-kernel new is failing, because you are allocating too much memory.

In-kernel new has similar behavior and limitations as in-kernel malloc. These allocations are limited to the device heap, which starts out by default at 8MB. If the 250x250 array size corresponds to something in that range (8MB), then going significantly above that would cause some of the new operations to "silently" fail (i.e. return null pointers). If you then try to use those null pointers, you'll get an illegal memory access.

A few recommendations:

Figure out how much space you need, and pre-reserve it ahead of time on the device heap using cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)
When you're having trouble with kernels that use new or malloc, it may be useful for debug purposes to perhaps use a debug macro to check the returned pointers for NULL. This is a good practice in general.
You can learn how to debug an illegal memory access with more clarity (localizing it to a specific line in a specific kernel) using the method described here.