FeatureUnion in scikit klearn and incompatible row

I have started to use scikit learn for text extraction. When I use standard function CountVectorizer and TfidfTransformer in a pipeline and when I try to combine with new features ( a concatention of matrix) I have got a row dimension problem.

This is my pipeline:

pipeline = Pipeline([('feats', FeatureUnion([
('ngram_tfidf', Pipeline([('vect', CountVectorizer()),'tfidf', TfidfTransformer())])),
('addned', AddNed()),])), ('clf', SGDClassifier()),])

This is my class AddNEd which add 30 news features on each documents (sample).

class AddNed(BaseEstimator, TransformerMixin):
def __init__(self):
    pass

def transform (self, X, **transform_params):
    do_something
    x_new_feat = np.array(list_feat)
    print(type(X))
    X_np = np.array(X)
    print(X_np.shape, x_new_feat.shape)
    return np.concatenate((X_np, x_new_feat), axis = 1)

def fit(self, X, y=None):
    return self

And the first part of my main programm

data = load_files('HO_without_tag')
grid_search = GridSearchCV(pipeline, parameters, n_jobs = 1, verbose = 20)
print(len(data.data), len(data.target))
grid_search.fit(X, Y).transform(X)

But I get this result:

486 486
Fitting 3 folds for each of 3456 candidates, totalling 10368 fits
[CV]feats__ngram_tfidf__vect__max_features=3000....
323
<class 'list'>
(323,) (486, 30)

And of course a Indexerror Exception

return np.concatenate((X_np, x_new_feat), axis = 1)
IndexError: axis 1 out of bounds [0, 1

When I have the params X in transform function (class AddNed) why I don't have a numpy array (486, 3000) shape for X. I have only (323,) shape. I don't understand because if I delete Feature Union and AddNed() pipeline, CountVectorizer and tf_idf work properly with the right features and the right shape. If anyone have an idea? Thanks a lot.

标签： python scikit-learn pipeline feature-selection grid-search

2条回答

唯我独甜

2楼-- · 2019-09-10 08:24

OK, I will try to give more explication. When I say do_something, I say do_nothing with X. In the class AddNed if I rewrite :

def transform (self, X, **transform_params):
    print(X.shape)   #Print X shape on first line before do anything
    print(type(X))   #For information
    do_nothing_withX #Construct a new matrix with a shape (number of samples, 30 new features) 
    x_new_feat = np.array(list_feat) #Get my new matrix in numpy array 
    print(x_new_feat.shape) 
    return x_new_feat

In this transform case above, I do not concatenate X matrix and new matrix. I presume features union do that... And my result:

486 486   #Here it is a print (data.data, data.target)
Fitting 3 folds for each of 3456 candidates, totalling 10368 fits
[CV] clf__alpha=1e-05, vect__max_df=0.1, clf__penalty=l2, feats__tfidf__use_idf=True, feats__tfidf__norm=l1, clf__loss=hinge, vect__ngram_range=(1, 1), clf__n_iter=10, vect__max_features=3000
(323, 3000) # X shape Matrix
<class 'scipy.sparse.csr.csr_matrix'>
(486, 30) # My new matrix shape
Traceback (most recent call last):
File "pipe_line_learning_union.py", line 134, in <module>
grid_search.fit(X, Y).transform(X)
.....
File "/data/maclearnVE/lib/python3.4/site-packages/scipy/sparse/construct.py", line 581, in bmat
raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
ValueError: blocks[0,:] has incompatible row dimensions

To go further, just to see, if if I put a cross validation on gridsearchCV, just to modify sample size:

grid_search = GridSearchCV(pipeline, parameters, cv=2, n_jobs = 1, verbose = 20)

I have this result:

486 486
Fitting 2 folds for each of 3456 candidates, totalling 6912 fits
[CV] ......
(242, 3000) #This a new sample size due to cross validation
<class 'scipy.sparse.csr.csr_matrix'>
(486, 30)
..........
ValueError: blocks[0,:] has incompatible row dimensions

Of course if it is necessary, I can give all the code of do_nothing_withX. But what I don't understand, it is why sample size with the pipeline countvectorizer+tdf_idf it is not equal to the number of files load with sklearn.datasets.load_files() function.

0人赞添加讨论(0) 举报

甜甜的少女心

3楼-- · 2019-09-10 08:47

You've probably solved it by now, but someone else may have the same problem:

(323, 3000) # X shape Matrix
<class 'scipy.sparse.csr.csr_matrix'>

AddNed tries to concatenate a matrix with a sparse matrix, the sparse matrix should be transformed to dense matrix first. I've found the same error trying to use the result of CountVectorizer

0人赞添加讨论(0) 举报

FeatureUnion in scikit klearn and incompatible row

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间