For a machine learning task I am looking for a way to merge two feature matrices, with different dimensions, so that I can feed them both to an estimator. I cannot use the scipy merging methods since these require compatible shapes. I can use the numpy merging methods, but that goes wrong when I actually try to split the array for cross validation. The error looks like this:
Traceback (most recent call last):
File "C:\Users\Ano\workspace\final_submission\src\linearSVM.py", line 50, in <module>
result = ridge(train_text,train_labels,test_set,train_state,test_state)
File "C:\Users\Ano\workspace\final_submission\src\Algorithms.py", line 90, in ridge
x_train, x_test, y_train, y_test = cross_validation.train_test_split(train, labels, test_size = 0.2, random_state = 42)
File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1394, in train_test_split
arrays = check_arrays(*arrays, **options)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 211, in check_arrays
% (size, n_samples))
ValueError: Found array with dim 77946. Expected 2
The reason that this error occurs have I found in another stackoverflow question thread:Concatenate sparse matrices in Python using SciPy/Numpy. Apparently np.vstack/hstack create two matrix objects, which caused my error.
The shapes I am dealing with:
(77946, 63677)
(77946, 55)
Basically, I am looking for a way to append those 55 extra features per sample from the second matrix to the features in the first matrix.
I also tried to create a numpy array with the appropriate dimensions and simply fill it with the feature matrices, but even creating that matrix gave me a memory error. I tried to convert it to a sparse matrix, but that didn't work either. Perhaps I am doing something wrong there?
new_matrix = sparse.csr_matrix(np.zeros((77946,63727)))
new_matrix[:,0:63676] = big_feature_matrix
new_matrix[:,63677:63727] = small_feature_matrix
Update So tried out Jaime's solution but it gave me an error:
Code involved
def feature_extraction(train,test,train_small,test_small):
vectorizer = TfidfVectorizer(min_df = 3,strip_accents = "unicode",ngram_range = (1,2))
cv = CountVectorizer(strip_accents = "unicode",analyzer = "word",token_pattern = r'\w{1,}')
print("fitting Vectorizer")
vectorizer.fit(train)
train_small = cv.fit_transform(train_state)
test_small = cv.transform(test_state)
print("transforming text")
train = vectorizer.transform(train)
test = vectorizer.transform(test)
new_train = sparse.hstack((train, train_small),
format='csr')
new_test = sparse.hstack((test, test_small),
format='csr')
return new_train,new_test
Full traceback
Traceback (most recent call last):
File "C:\Users\Ano\workspace\final_submission\src\linearSVM.py", line 50, in <module>
result = ridge(train_text,train_labels,test_set,train_small,test_small)
File "C:\Users\Ano\workspace\final_submission\src\Algorithms.py", line 89, in ridge
train,test = feature_extraction(train,test,train_small,test_small)
File "C:\Users\Ano\workspace\final_submission\src\Preprocessing.py", line 109, in feature_extraction
format='csr')
File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 423, in hstack
return bmat([blocks], format=format, dtype=dtype)
File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 523, in bmat
raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
ValueError: blocks[0,:] has incompatible row dimensions
The train sets have the same dimensions as before. The test sets have less samples (42157).
Update
Jaime's solution, did actually work, I just messed up when I loaded in the files, thank you for all your help!
You can use
scipy.sparse.hstack
: