Finding a correlation between variable and class v

2019-06-05 05:28发布

问题:

I have a dataset which contains 7 numerical attributes and one nominal which is the class variable. I was wondering how I can the best attribute that can be used to predict the class attribute. Would finding the largest information gain by each attribute be the solution?

回答1:

So the problem you are asking about falls under the domain of feature selection, and more broadly, feature engineering. There is a lot of literature online regarding this, and there are definitely a lot of blogs/tutorials/resources online for how to do this.

To give you a good link I just read through, here is a blog with a tutorial on some ways to do feature selection in Weka, and the same blog's general introduction on feature selection. Naturally there are a lot of different approaches, as knb's answer pointed out.

To give a short description though, there are a few ways to go about it: you can assign a score to each of your features (like information gain, etc) and filter out features with 'bad' scores; you can treat finding the best parameters as a search problem, where you take different subsets of the features and assess the accuracy in turn; and you can use embedded methods, which kind of learn which features contribute most to the accuracy as the model is being built. Examples of embedded methods are regularization algorithms like LASSO and ridge regression.



回答2:

Do you just want that attribute's name, or do you also want a quantifiable metric (like a t-value) for this "best" attribute?

For a qualitative approach, you can generate a classification tree with just one split, two leaves.

For example, weka's "diabetes.arff" sample-dataset (n = 768), which has a similar structure as your dataset (all attribs numeric, but the class attribute has only two distinct categorical outcomes), I can set the minNumObj parameter to, say, 200. This means: create a tree with minimum 200 instances in each leaf.

java -cp $WEKA_JAR/weka.jar  weka.classifiers.trees.J48 -C 0.25 -M 200 -t data/diabetes.arff 

Output:

J48 pruned tree
------------------

plas <= 127: tested_negative (485.0/94.0)
plas > 127: tested_positive (283.0/109.0)

Number of Leaves  :     2

Size of the tree :  3


Time taken to build model: 0.11 seconds
Time taken to test model on training data: 0.04 seconds

=== Error on training data ===

Correctly Classified Instances         565               73.5677 %

This creates a tree with one split on the "plas" attribute. For interpretation, this makes sense, because indeed, patients with diabetes have an elevated concentration of glucose in their blood plasma. So "plas" is the most important attribute, as it was chosen for the first split. But this does not tell you how important.

For a more quantitative approach, maybe you can use (Multinomial) Logistic Regression. I'm not so familiar with this, but anyway:

In the Exlorer GUI Tool, choose "Classify" > Functions > Logistic.

Run the model. The odds ratio and the coefficients might contain what you need in a quantifiable manner. Lower odds-ratio (but > 0.5) is better/more significant, but I'm not sure. Maybe read on here, this answer by someone else.

    java -cp $WEKA_JAR/weka.jar  weka.classifiers.functions.Logistic -R 1.0E-8 -M -1 -t data/diabetes.arff 

Here's the command line output

Options: -R 1.0E-8 -M -1 

Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
                       Class
Variable     tested_negative
============================
preg                 -0.1232
plas                 -0.0352
pres                  0.0133
skin                 -0.0006
insu                  0.0012
mass                 -0.0897
pedi                 -0.9452
age                  -0.0149
Intercept             8.4047


Odds Ratios...
                       Class
Variable     tested_negative
============================
preg                  0.8841
plas                  0.9654
pres                  1.0134
skin                  0.9994
insu                  1.0012
mass                  0.9142
pedi                  0.3886
age                   0.9852

=== Error on training data ===

Correctly Classified Instances         601               78.2552 %
Incorrectly Classified Instances       167               21.7448 %