I'm new to scikit-learn. I'm trying use preprocessing. OneHotEncoder to encode my training and test data. After encoding I tried to train Random forest classifier using that data. But I get the following error when fitting. (Here the error trace)
99 model.fit(X_train, y_train)
100 preds = model.predict_proba(X_cv)[:, 1]
101
C:\Python27\lib\site-packages\sklearn\ensemble\forest.pyc in fit(self, X, y, sample_weight)
288
289 # Precompute some data
--> 290 X, y = check_arrays(X, y, sparse_format="dense")
291 if (getattr(X, "dtype", None) != DTYPE or
292 X.ndim != 2 or
C:\Python27\lib\site-packages\sklearn\utils\validation.pyc in check_arrays(*arrays, **options)
200 array = array.tocsc()
201 elif sparse_format == 'dense':
--> 202 raise TypeError('A sparse matrix was passed, but dense '
203 'data is required. Use X.toarray() to '
204 'convert to a dense numpy array.')
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
I tried to convert the sparse matrix into dense using X.toarray() and X.todense() But when I do that, I get the following error trace.
99 model.fit(X_train.toarray(), y_train)
100 preds = model.predict_proba(X_cv)[:, 1]
101
C:\Python27\lib\site-packages\scipy\sparse\compressed.pyc in toarray(self)
548
549 def toarray(self):
--> 550 return self.tocoo(copy=False).toarray()
551
552 ##############################################################
C:\Python27\lib\site-packages\scipy\sparse\coo.pyc in toarray(self)
236
237 def toarray(self):
--> 238 B = np.zeros(self.shape, dtype=self.dtype)
239 M,N = self.shape
240 coo_todense(M, N, self.nnz, self.row, self.col, self.data, B.ravel())
ValueError: array is too big.
Can anyone help me to fix this.
Thank you
sklearn random forests do not work on sparse input and your dataset shape is to large and too sparse for a dense version to fit in memory.
You probably have some categorical features with a much to large cardinality (for instance a free text field or unique entry ids). Try to drop those features and start over.