Why is my rather trivial CUDA program erring with

I made a simple CUDA program for practice. It simply copies over data from one array to another:

import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pycuda.compiler import SourceModule

# Global constants
N = 2**20 # size of array a
a = np.linspace(0, 1, N)
e = np.empty_like(a)
block_size_x = 512

# Instantiate block and grid sizes.
block_size = (block_size_x, 1, 1)
grid_size = (N / block_size_x, 1)

# Create the CUDA kernel, and run it.
mod = SourceModule("""
  __global__ void D2x_kernel(double* a, double* e, int N) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid > 0 && tid < N - 1) {
      e[tid] = a[tid];
    }
  }
""")
func = mod.get_function('D2x_kernel')
func(a, cuda.InOut(e), np.int32(N), block=block_size, grid=grid_size)
print str(e)

However, I get this error: pycuda._driver.LogicError: cuLaunchKernel failed: invalid value

When I get rid of the second argument double* e in my kernel function and invoke the kernel without the argument e, the error goes away. Why is that? What does this error mean?

Your a array does not exist in device memory, so I suspect that PyCUDA is ignoring (or otherwise handling) the first argument to your kernel invocation and only passing in e and N...so you get an error because the kernel was expecting three arguments and it has only received two. Removing double* e from your kernel definition might eliminate the error message you're getting, but your kernel still won't work properly.

A quick fix to this should be to wrap a in a cuda.In() call, which instructs PyCUDA to copy a to the device before launching the kernel. That is, your kernel launch line should be:

func(cuda.In(a), cuda.InOut(e), np.int32(N), block=block_size, grid=grid_size)

Edit: Also, do you realize that your kernel is not copying the first and last elements of a to e? Your if (tid > 0 && tid < N - 1) statement is preventing that. For the entire array, it should be if (tid < N).