Calculate special correlation distance matrix fast

I would like to build a distance matrix using Pearson correlation distance. I first tried the scipy.spatial.distance.pdist(df,'correlation') which is very fast for my 5000 rows * 20 features dataset.

Since I want to build a recommender, I wanted to slightly change the distance, only considering features which are distinct for NaN for both users. Indeed, scipy.spatial.distance.pdist(df,'correlation') output NaN when it meets any feature whose value is float('nan').

Here is my code, df being my 5000*20 pandas DataFrame

dist_mat = []
d = df.shape[1]
for i,row_i in enumerate(df.itertuples()):
    for j,row_j in enumerate(df.itertuples()):
        if i<j:
            print(i,j)
            ind = [False if (math.isnan(row_i[t+1]) or math.isnan(row_j[t+1])) else True for t in range(d)]
            dist_mat.append(scipy.spatial.distance.correlation([row_i[t] for t in ind],[row_j[t] for t in ind]))

This code works but it is ashtoningly slow compared to the scipy.spatial.distance.pdist(df,'correlation') one. My question is: how can I improve my code so it runs a lot faster? Or where can I find a library which calculates correlation between two vectors which only take in consideration features which appears in both of them?

Thank you for your answers.

标签： python matrix pandas distance correlation

1条回答

可以哭但决不认输i

2楼-- · 2019-04-13 11:47

I think you need to do this with Cython, here is an example:

#cython: boundscheck=False, wraparound=False, cdivision=True

import numpy as np

cdef extern from "math.h":
    bint isnan(double x)
    double sqrt(double x)

def pair_correlation(double[:, ::1] x):
    cdef double[:, ::] res = np.empty((x.shape[0], x.shape[0]))
    cdef double u, v
    cdef int i, j, k, count
    cdef double du, dv, d, n, r
    cdef double sum_u, sum_v, sum_u2, sum_v2, sum_uv

    for i in range(x.shape[0]):
        for j in range(i, x.shape[0]):
            sum_u = sum_v = sum_u2 = sum_v2 = sum_uv = 0.0
            count = 0            
            for k in range(x.shape[1]):
                u = x[i, k]
                v = x[j, k]
                if u == u and v == v:
                    sum_u += u
                    sum_v += v
                    sum_u2 += u*u
                    sum_v2 += v*v
                    sum_uv += u*v
                    count += 1
            if count == 0:
                res[i, j] = res[j, i] = -9999
                continue

            um = sum_u / count
            vm = sum_v / count
            n = sum_uv - sum_u * vm - sum_v * um + um * vm * count
            du = sqrt(sum_u2 - 2 * sum_u * um + um * um * count) 
            dv = sqrt(sum_v2 - 2 * sum_v * vm + vm * vm * count)
            r = 1 - n / (du * dv)
            res[i, j] = res[j, i] = r
    return res.base

To check the output without NAN:

import numpy as np
from scipy.spatial.distance import pdist, squareform, correlation
x = np.random.rand(2000, 20)
np.allclose(pair_correlation(x), squareform(pdist(x, "correlation")))

To check the output with NAN:

x = np.random.rand(2000, 20)
x[x < 0.3] = np.nan
r = pair_correlation(x)

i, j = 200, 60 # change this
mask = ~(np.isnan(x[i]) | np.isnan(x[j]))
u = x[i, mask]
v = x[j, mask]
assert abs(correlation(u, v) - r[i, j]) < 1e-12

0人赞添加讨论(0) 举报

Calculate special correlation distance matrix fast

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间