Persist Tf-Idf data

2019-02-07 11:03发布

I want to store the TF-IDF matrix so I don't have to recalculate it all the time. I am using scikit-learn's TfIdfVectorizer. Is it more efficient to pickle it or store it in a database?

Some context: I am using k-means clustering to provide document recommendation. Since new documents are added frequently, I would like to store the TF-IDF values of the documents so that I can recalculate the clusters.

标签： python machine-learning scikit-learn pickle

1条回答

等我变得足够好

2楼-- · 2019-02-07 11:41

Pickling (especially using joblib.dump) is good for short term storage, e.g. to save a partial results in an interactive session or ship a model from a development server to a production server.

However the pickling format is dependent on the class definitions of the models that might change from one version of scikit-learn to another.

I would recommend to write your own implementation-independant persistence model if you plan to keep the model for a long time and make it possible to load it in future versions of scikit-learn.

I would also recommend to use the HDF5 file format (for instance used in PyTables) or other database systems that have some kind of support for storing numerical arrays efficiently.

Also have a look at the internal CSR and COO datastructures for sparse matrix representation of scipy.sparse to come up with an efficient way to store those in a database.

0人赞添加讨论(0) 举报

Persist Tf-Idf data

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间