I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?
相关问题
- How to remove spaces in between characters without
- Removing duplicate dataframes in a list
- Groupby with weight
- Pandas reshape dataframe by adding a column level
- Replacing more than n consecutive values in Pandas
相关文章
- implementing R scale function in pandas in Python?
- Raspberry Pi-Python: Install Pandas on Python 3.5.
- How to apply multiple functions to a groupby objec
- How to remove seconds from datetime?
- 'DataFrame' object has no attribute 'i
- OLS with pandas: datetime index as predictor
- Dask read_csv fails where pandas doesn't
- How to check if any value of a column is in a rang
There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using
cut
ofpandas
.I was looking to do same thing in BigQuery. For numeric features you can use built in CORR(x,y) function. For categorical features, you can calculate it as: cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2). Which translates to following SQL:
Higher number means lower correlation.
I used following python script to generate SQL:
It should be straightforward to do same thing in numpy.