I have a csv file of [66k, 56k] size (rows, columns). Its a sparse matrix. I know that numpy can handle that size a matrix. I would like to know based on everyone's experience, how many features scikit-learn algorithms can handle comfortably?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
Some linear model (Regression, SGD, Bayes) will probably be your best bet if you need to train your model frequently.
Although before you go running any models you could try the following
1) Feature reduction. Are there features in your data that could easily be removed? For example if your data is text or ratings based there are lots known options available.
2) Learning curve analysis. Maybe you only need a small subset of your data to train a model, and after that you are only fitting to your data or gaining tiny increases in accuracy.
Both approaches could allow you to greatly reduce the training data required.
Depends on the estimator. At that size, linear models still perform well, while SVMs will probably take forever to train (and forget about random forests since they won't handle sparse matrices).
I've personally used
LinearSVC
,LogisticRegression
andSGDClassifier
with sparse matrices of size roughly 300k × 3.3 million without any trouble. See @amueller's scikit-learn cheat sheet for picking the right estimator for the job at hand.Full disclosure: I'm a scikit-learn core developer.