Efficient columnwise correlation coefficient calcu

Original question

I am correlating row P of size n to every column of matrix O of size n×m. I crafted the following code:

import numpy as np

def ColumnWiseCorrcoef(O, P):
    n = P.size
    DO = O - (np.sum(O, 0) / np.double(n))
    DP = P - (np.sum(P) / np.double(n))
    return np.dot(DP, DO) / np.sqrt(np.sum(DO ** 2, 0) * np.sum(DP ** 2))

It is more efficient than the naive approach:

def ColumnWiseCorrcoefNaive(O, P):
    return np.corrcoef(P,O.T)[0,1:O[0].size+1]

Here are the timings I get with numpy-1.7.1-MKL on an Intel core:

O = np.reshape(np.random.rand(100000), (1000,100))
P = np.random.rand(1000)

%timeit -n 1000 A = ColumnWiseCorrcoef(O, P)
1000 loops, best of 3: 787 us per loop
%timeit -n 1000 B = ColumnWiseCorrcoefNaive(O, P)
1000 loops, best of 3: 2.96 ms per loop

Now the question: can you suggest an even faster version of the code for this problem? Squeezing out additional 20% would be great.

UPDATE May 2017

After quite some time I got back to this problem, re-run and extended the task and the tests.

Using einsum, I have extended the code to the case where P is not a row but a matrix. So the task is correlating all columns of O to all columns of P.
Being curious on how the same problem can be solved in different languages commonly used for scientific computing, I implemented it (with help from other people) in MATLAB, Julia, and R. MATLAB and Julia are the fastest, and they have a dedicated routine to compute columnwise correlation. R also has a dedicated routine but is the slowest.
In the current version of numpy (1.12.1 from Anaconda) einsum still wins over dedicated functions I used.

All the scripts and timings are available at https://github.com/ikizhvatov/efficient-columnwise-correlation.

We can introduce np.einsum to this; however, your milage may vary depending on your compilation and if it makes use of SSE2 or not. The extra einsum calls to replace summation operations might seem extraneous, but numpy ufuncs do not make use of SSE2 until numpy 1.8 while einsum does and we can avoid a few if statements.

Running this on a opteron core with intel mkl blas I get a odd result as I would expect the dot call to take the majority of the time.

def newColumnWiseCorrcoef(O, P):
    n = P.size
    DO = O - (np.einsum('ij->j',O) / np.double(n))
    P -= (np.einsum('i->',P) / np.double(n))
    tmp = np.einsum('ij,ij->j',DO,DO)
    tmp *= np.einsum('i,i->',P,P)          #Dot or vdot doesnt really change much.
    return np.dot(P, DO) / np.sqrt(tmp)

O = np.reshape(np.random.rand(100000), (1000,100))
P = np.random.rand(1000)

old = ColumnWiseCorrcoef(O,P)
new = newColumnWiseCorrcoef(O,P)

np.allclose(old,new)
True

%timeit ColumnWiseCorrcoef(O,P)
100 loops, best of 3: 1.52ms per loop

%timeit newColumnWiseCorrcoef(O,P)
1000 loops, best of 3: 518us per loop

Running this with a intel system with intel mkl again I get something more reasonable/expected:

%timeit ColumnWiseCorrcoef(O,P)
1000 loops, best of 3: 524 µs per loop

%timeit newColumnWiseCorrcoef(O,P)
1000 loops, best of 3: 354 µs per loop

Again on the intel machine with something a bit bigger:

O = np.random.rand(1E5,1E3)
P = np.random.rand(1E5)

%timeit ColumnWiseCorrcoef(O,P)
1 loops, best of 3: 1.33 s per loop

%timeit newColumnWiseCorrcoef(O,P)
1 loops, best of 3: 791 ms per loop