Categorical features correlation

I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?

标签： pandas machine-learning categorical-data feature-engineering

2条回答

一夜七次

2楼-- · 2020-05-20 00:29

There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut of pandas.

import pandas as pd
import numpy as np
import scipy.stats as ss
import seaborn as sns

tips = sns.load_dataset("tips")

tips["total_bill_cut"] = pd.cut(tips["total_bill"],
                                np.arange(0, 55, 5),
                                include_lowest=True,
                                right=False)

def cramers_v(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher,
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

confusion_matrix = pd.crosstab(tips["day"], tips["time"]).as_matrix()
cramers_v(confusion_matrix)
# Out[10]: 0.93866193407222209

confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"]).as_matrix()
cramers_v(confusion_matrix)
# Out[24]: 0.16498707494988371

0人赞添加讨论(0) 举报

Summer. ? 凉城

3楼-- · 2020-05-20 00:41

I was looking to do same thing in BigQuery. For numeric features you can use built in CORR(x,y) function. For categorical features, you can calculate it as: cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2). Which translates to following SQL:

SELECT 
COUNT(DISTINCT(CONCAT(cat1, cat2))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat2))) as cat1_2,
COUNT(DISTINCT(CONCAT(cat1, cat3))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat3))) as cat1_3,
....
FROM ...

Higher number means lower correlation.

I used following python script to generate SQL:

import itertools

arr = range(1,10)

query = ',\n'.join(list('COUNT(DISTINCT(CONCAT({a}, {b}))) / GREATEST (COUNT(DISTINCT({a})), COUNT(DISTINCT({b}))) as cat{a}_{b}'.format(a=a,b=b) 
  for (a,b) in itertools.combinations(arr,2)))
query = 'SELECT \n ' + query + '\n FROM  `...`;'
print (query)

It should be straightforward to do same thing in numpy.

0人赞添加讨论(0) 举报

Categorical features correlation

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间