Is there any way to compute weighted correlation coefficient with pandas? I saw that R has such a method. Also, I'd like to get the p value of the correlation. This I did not find also in R. Link to Wikipedia for explanation about weighted correlation: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Weighted_correlation_coefficient
问题:
回答1:
I don't know of any Python packages that implement this, but it should be fairly straightforward to roll your own implementation. Using the naming conventions of the wikipedia article:
def m(x, w):
"""Weighted Mean"""
return np.sum(x * w) / np.sum(w)
def cov(x, y, w):
"""Weighted Covariance"""
return np.sum(w * (x - m(x, w)) * (y - m(y, w))) / np.sum(w)
def corr(x, y, w):
"""Weighted Correlation"""
return cov(x, y, w) / np.sqrt(cov(x, x, w) * cov(y, y, w))
I tried to make the functions above match the formulas in the wikipedia as closely as possible, but there are some potential simplifications and performance improvements. For example, as pointed out by @Alberto Garcia-Raboso, m(x, w)
is really just np.average(x, weights=w)
, so there's no need to actually write a function for it.
The functions are pretty bare-bones, just doing the calculations. You may want to consider forcing inputs to be arrays prior to doing the calculations, i.e. x = np.asarray(x)
, as these functions will not work if lists are passed. Additional checks to verify all inputs have equal length, non-null values, etc. could also be implemented.
Example usage:
# Initialize a DataFrame.
np.random.seed([3,1415])
n = 10**6
df = pd.DataFrame({
'x': np.random.choice(3, size=n),
'y': np.random.choice(4, size=n),
'w': np.random.random(size=n)
})
# Compute the correlation.
r = corr(df['x'], df['y'], df['w'])
There's a discussion here regarding the p-value. It doesn't look like there's a generic calculation, and it depends on how you're actually getting the weights.
回答2:
The statsmodels package has an implementation of weighted correlation.