问题:

I have a csv file of [66k, 56k] size (rows, columns). Its a sparse matrix. I know that numpy can handle that size a matrix. I would like to know based on everyone's experience, how many features scikit-learn algorithms can handle comfortably?

回答1:

Depends on the estimator. At that size, linear models still perform well, while SVMs will probably take forever to train (and forget about random forests since they won't handle sparse matrices).

I've personally used LinearSVC, LogisticRegression and SGDClassifier with sparse matrices of size roughly 300k × 3.3 million without any trouble. See @amueller's scikit-learn cheat sheet for picking the right estimator for the job at hand.

Full disclosure: I'm a scikit-learn core developer.

回答2:

Some linear model (Regression, SGD, Bayes) will probably be your best bet if you need to train your model frequently.

Although before you go running any models you could try the following

1) Feature reduction. Are there features in your data that could easily be removed? For example if your data is text or ratings based there are lots known options available.

2) Learning curve analysis. Maybe you only need a small subset of your data to train a model, and after that you are only fitting to your data or gaining tiny increases in accuracy.

Both approaches could allow you to greatly reduce the training data required.