Negative Speed Gain Using Numba Vectorize target=&

I am trying to test out the effectiveness of using the Python Numba module's @vectorize decorator for speeding up a code snippet relevant to my actual code. I'm utilizing a code snippet provided in CUDAcast #10 available here and shown below:

import numpy as np
from timeit import default_timer as timer
from numba import vectorize


@vectorize(["float32(float32, float32)"], target='cpu')
def VectorAdd(a,b):
        return a + b

def main():
        N = 32000000

        A = np.ones(N, dtype=np.float32)
        B = np.ones(N, dtype=np.float32)
        C = np.zeros(N, dtype=np.float32)


        start = timer()
        C = VectorAdd(A, B)
        vectoradd_time = timer() - start

        print("C[:5] = " + str(C[:5]))
        print("C[-5:] = " + str(C[-5:]))

        print("VectorAdd took %f seconds" % vectoradd_time)

if __name__ == '__main__':
        main()

In the demo in the CUDAcast, the demonstrator gets a 100x speedup by sending the large array equation to the gpu via the @vectorize decorator. However, when I set the @vectorize target to the gpu:

@vectorize(["float32(float32, float32)"], target='cuda')

... the result is 3-4 times slower. With target='cpu' my runtime is 0.048 seconds; with target='cuda' my runtime is 0.22 seconds. I'm using a DELL Precision laptop with Intel Core i7-4710MQ processor and NVIDIA Quadro K2100M GPU. The output of running nvprof (NVIDIA profiler tool) indicate that the majority of the time is spent in memory handling (expected), but even the function evaluation takes longer on the GPU than the whole process did on the CPU. Obviously this isn't the result I was hoping for, but is it due to some error on my part or is this reasonable based on my hardware and code?

This question is also interesting for me. I've tried your code and got similar results. To somehow investigate this issue I've wrote the CUDA kernel using cuda.jit and add it in your code:

import numpy as np
from timeit import default_timer as timer
from numba import vectorize, cuda

N = 16*50000 #32000000
blockdim = 16, 1
griddim = int(N/blockdim[0]), 1

@cuda.jit("void(float32[:], float32[:])")
def VectorAdd_GPU(a, b):
    i = cuda.grid(1)
    if i < N:
        a[i] += b[i]

@vectorize("float32(float32, float32)", target='cpu')
def VectorAdd(a,b):
    return a + b


A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)

start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("VectorAdd took %f seconds" % vectoradd_time)

start = timer()
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
VectorAdd_GPU[griddim,blockdim](d_A, d_B)
C = d_A.copy_to_host()
vectoradd_time = timer() - start
print("VectorAdd_GPU took %f seconds" % vectoradd_time)

print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))

In this 'benchmark' I also take into account the time for copying of arrays from host to device and from device to host. In this case the GPU function is slowly than CPU one.

For the case above:

CPU - 0.0033; 
GPU - 0.0096; 
Vectorize (target='cuda') - 0.15 (for my PC).

If the copying time is not accounted:

GPU - 0.000245

So, what I have learned, (1) The copying from host to device and from device to host is time-consuming. It is obvious and well-known. (2) I do not know the reason but @vectorize can significantly slowing down the calculations on GPU. (3) It is better to use self-written kernels (and of course minimize the memory copying).

By the way I have also tested the @cuda.jit by solving heat-conduction equation by explicit finite-difference scheme and found that for this case python program execution time is comparable with C program and provide about 100 times speedup. It is because, fortunately in this case you can do many iterations without data exchange between host and device.

UPD. Used Software & Hardware: Win7 64bit, CPU: Intel Core2 Quad 3GHz, GPU: NVIDIA GeForce GTX 580.