I have few categorical features:
['Gender',
'Married',
'Dependents',
'Education',
'Self_Employed',
'Property_Area']
from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency((pd.crosstab(df.Gender, df.Married).values))
print (f'Chi-square Statistic : {chi2} ,p-value: {p}')
output:
Chi-square Statistic : 79.63562874824729 ,p-value: 4.502328957824834e-19
How can I know if the features are independent from each other from these statistics?
I am trying to build a classification model so I just wanted to know are these categorical columns useful for predicting my target variable.
Contingency tables are used in statistics to summarize the relationship between several categorical variables.
In your example, The Contingency table between the two variables Gender
and Married
is a Frequency table of these variables presented simultaneously.
A chi-squared test conducted on a contingency table can test whether or not a relationship exists between variables. These effects are defined as relationships between rows and columns.
scipy.stats.chi2_contingency computes -by default- Pearson’s chi-squared statistic.
Moreover,we are interested in the Sig(2-Tailed)
which is the p-value in your example.
The p-value is the evidence against a null hypothesis. The smaller the p-value, the strong the evidence that you should reject the null hypothesis.
And the null hypothesis in your case is the dependence of the observed frequencies in the contingency table.
Choosing Significant Level -alpha as 5%; your p-value is 4.502328957824834e-19
is much less than .05
indicating that the rows and columns of the contingency table are independent. Generally this means that it is worthwhile to interpret the cells in the contingency table.
In this particular case it means that being Male or Female (i.e. Gender) is not distributed similarly across the different levels of Marital Status (i.e. Married, Not-Married).
So, being married may be the status of one gender more than the other!
Update
According to your comment, I see you have some doubts about this test.
This test basically tells you if the relationship between variables is Significant (i.e. may represent the population) or came by chance!
So if you have high level of Significance (high p-value), that means there's a significant dependency between the variables!
Now, if Gender
and Married
are both features in your model, that may lead to over-fitting and features redundancy. Then, you may want to choose one of them.
But if Gender
or Married
is the dependent variable (like y
), then it's good they have significant relationship.
Extra bonus:
Sometimes one of the features become temporarily a dependent variable during Data Imputation (when you have missing values).