I am trying to select the best features using chi-square (scikit-learn 0.10). From a total of 80 training documents I first extract 227 feature, and from these 227 features I want to select the top 10 ones.
my_vectorizer = CountVectorizer(analyzer=MyAnalyzer())
X_train = my_vectorizer.fit_transform(train_data)
X_test = my_vectorizer.transform(test_data)
Y_train = np.array(train_labels)
Y_test = np.array(test_labels)
X_train = np.clip(X_train.toarray(), 0, 1)
X_test = np.clip(X_test.toarray(), 0, 1)
ch2 = SelectKBest(chi2, k=10)
print X_train.shape
X_train = ch2.fit_transform(X_train, Y_train)
print X_train.shape
The results are the following.
(80, 227)
(80, 14)
They are similar if I set k
equal to 100
.
(80, 227)
(80, 227)
Why does this happen?
*EDIT: A full output example , now without clipping, where I request 30 and got 32 instead:
Train instances: 9 Test instances: 1
Feature extraction...
X_train:
[[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0]
[0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1]
[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0]
[0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]]
Y_train:
[0 0 0 0 0 0 0 0 1]
32 features extracted from 9 training documents.
Feature selection...
(9, 32)
(9, 32)
Using 32(requested:30) best features from 9 training documents
get support:
[ True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True]
get support with vocabulary :
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31]
Training...
/usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11
scale_C)
Classifying...
Another example without clipping, where I request 10 and get 11 instead:
Train instances: 9 Test instances: 1
Feature extraction...
X_train:
[[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0]
[0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1]
[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0]
[0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]]
Y_train:
[0 0 0 0 0 0 0 0 1]
32 features extracted from 9 training documents.
Feature selection...
(9, 32)
(9, 11)
Using 11(requested:10) best features from 9 training documents
get support:
[ True True True False False True False False False False True False
False False True False False False True False True False True True
False False False False True False False False]
get support with vocabulary :
[ 0 1 2 5 10 14 18 20 22 23 28]
Training...
/usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11
scale_C)
Classifying...