I would like to build a distance matrix using Pearson correlation distance.
I first tried the scipy.spatial.distance.pdist(df,'correlation')
which is very fast for my 5000 rows * 20 features dataset.
Since I want to build a recommender, I wanted to slightly change the distance, only considering features which are distinct for NaN for both users. Indeed, scipy.spatial.distance.pdist(df,'correlation')
output NaN when it meets any feature whose value is float('nan').
Here is my code, df being my 5000*20 pandas DataFrame
dist_mat = []
d = df.shape[1]
for i,row_i in enumerate(df.itertuples()):
for j,row_j in enumerate(df.itertuples()):
if i<j:
print(i,j)
ind = [False if (math.isnan(row_i[t+1]) or math.isnan(row_j[t+1])) else True for t in range(d)]
dist_mat.append(scipy.spatial.distance.correlation([row_i[t] for t in ind],[row_j[t] for t in ind]))
This code works but it is ashtoningly slow compared to the scipy.spatial.distance.pdist(df,'correlation')
one. My question is: how can I improve my code so it runs a lot faster? Or where can I find a library which calculates correlation between two vectors which only take in consideration features which appears in both of them?
Thank you for your answers.
I think you need to do this with Cython, here is an example:
To check the output without NAN:
To check the output with NAN: