numba guvectorize target='parallel' slower

I've been attempting to optimize a piece of python code that involves large multi-dimensional array calculations. I am getting counterintuitive results with numba. I am running on an MBP, mid 2015, 2.5 GHz i7 quadcore, OS 10.10.5, python 2.7.11. Consider the following:

 import numpy as np
 from numba import jit, vectorize, guvectorize
 import numexpr as ne
 import timeit

 def add_two_2ds_naive(A,B,res):
     for i in range(A.shape[0]):
         for j in range(B.shape[1]):
             res[i,j] = A[i,j]+B[i,j]

 @jit
 def add_two_2ds_jit(A,B,res):
     for i in range(A.shape[0]):
         for j in range(B.shape[1]):
             res[i,j] = A[i,j]+B[i,j]

 @guvectorize(['float64[:,:],float64[:,:],float64[:,:]'],
    '(n,m),(n,m)->(n,m)',target='cpu')
 def add_two_2ds_cpu(A,B,res):
     for i in range(A.shape[0]):
         for j in range(B.shape[1]):
             res[i,j] = A[i,j]+B[i,j]

 @guvectorize(['(float64[:,:],float64[:,:],float64[:,:])'],
    '(n,m),(n,m)->(n,m)',target='parallel')
 def add_two_2ds_parallel(A,B,res):
     for i in range(A.shape[0]):
         for j in range(B.shape[1]):
             res[i,j] = A[i,j]+B[i,j]

 def add_two_2ds_numexpr(A,B,res):
     res = ne.evaluate('A+B')

 if __name__=="__main__":
     np.random.seed(69)
     A = np.random.rand(10000,100)
     B = np.random.rand(10000,100)
     res = np.zeros((10000,100))

I can now run timeit on the various functions:

%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.16 ms per loop

%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.19 ms per loop

%timeit add_two_2ds_parallel(A,B,res)
100 loops, best of 3: 6.9 ms per loop

%timeit add_two_2ds_numexpr(A,B,res)
1000 loops, best of 3: 1.62 ms per loop

It seems that 'parallel' is not taking even using the majority of a single core, as it's usage in top shows that python is hitting ~40% cpu for 'parallel', ~100% for 'cpu', and numexpr hits ~300%.

There are two issues with your @guvectorize implementations. The first is that you are are doing all the looping inside your @guvectorize kernel, so there is actually nothing for the Numba parallel target to parallelize. Both @vectorize and @guvectorize parallelize on the broadcast dimensions in a ufunc/gufunc. Since the signature of your gufunc is 2D, and your inputs are 2D, there is only a single call to the inner function, which explains the only 100% CPU usage you saw.

The best way to write the function you have above is to use a regular ufunc:

@vectorize('(float64, float64)', target='parallel')
def add_ufunc(a, b):
    return a + b

Then on my system, I see these speeds:

%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.87 ms per loop

%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.81 ms per loop

%timeit add_two_2ds_parallel(A,B,res)
The slowest run took 11.82 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 2.43 ms per loop

%timeit add_two_2ds_numexpr(A,B,res)
100 loops, best of 3: 2.79 ms per loop

%timeit add_ufunc(A, B, res)
The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 2.03 ms per loop

(This is a very similar OS X system to yours, but with OS X 10.11.)

Although Numba's parallel ufunc now beats numexpr (and I see add_ufunc using about 280% CPU), it doesn't beat the simple single-threaded CPU case. I suspect that the bottleneck is due to memory (or cache) bandwidth, but I haven't done the measurements to check that.

Generally speaking, you will see much more benefit from the parallel ufunc target if you are doing more math operations per memory element (like, say, a cosine).