scikit学习大数据集矢量化(scikit-learn vectorizing with big

2019-09-29 03:38发布

站内文章 / 后端开发

24 0

戒情不戒烟

女 | 书童

私信

我有我的硬盘，我的VPS上分割文件的9GB只有4GB内存。

我怎样才能向量化所有的数据，而在初始化加载所有语料库设置？是否有任何示例代码？

我的代码如下：

contents = [open('./seg_corpus/' + filename).read()
            for filename in filenames]
vectorizer = CountVectorizer(stop_words=stop_words)
vectorizer.fit(contents)

Answer 1:

试试这个，而不是加载所有文本到内存中，你可以通过只处理到文件放到fit方法，但是你必须指定input='file'中CountVectorizer构造函数。

contents = [open('./seg_corpus/' + filename)
        for filename in filenames]
vectorizer = CountVectorizer(stop_words=stop_words, input='file')
vectorizer.fit(contents)

文章来源: scikit-learn vectorizing with big dataset