I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn for relatively lower number of samples, but I believe it can't be used for such a huge dataset as it loads the input matrix into memory first and that's an expensive process.
Does anyone know, what would be the best way to extract out the TFIDF vectors for large datasets?
gensim has an efficient tf-idf model and does not need to have everything in memory at once.
http://radimrehurek.com/gensim/intro.html
Your corpus simply needs to be an iterable, so it does not need to have the whole corpus in memory at a time.
The make_wiki script (https://github.com/piskvorky/gensim/blob/develop/gensim/scripts/make_wikicorpus.py) runs over wikipedia in about 50m on a laptop according to the comments.
I believe you can use a HashingVectorizer
to get a smallish csr_matrix
out of your text data and then use a TfidfTransformer
on that. Storing a sparse matrix of 8M rows and several tens of thousands of columns isn't such a big deal. Another option would be not to use TF-IDF at all- it could be the case that your system works reasonably well without it.
In practice you may have to subsample your data set- sometimes a system will do just as well by just learning from 10% of all available data. This is an empirical question, there is not way to tell in advance what strategy would be best for your task. I wouldn't worry about scaling to 8M document until I am convinced I need them (i.e. until I have seen a learning curve showing a clear upwards trend).
Below is something I was working on this morning as an example. You can see the performance of the system tends to improve as I add more documents, but it is already at a stage where it seems to make little difference. Given how long it takes to train, I don't think training it on 500 files is worth my time.