How to store this collection of documents?

2019-07-23 03:00发布

问题:

The dataset is like this:

39861    // number of documents
28102    // number of words of the vocabulary (another file)
3710420  // number of nonzero counts in the bag-of-words
1 118 1  // document_id index_in_vocabulary count
1 285 3
...
2 46 1
...
39861 27196 5

We are advised not to store that in matrix (of size 39861 x 39861 I guess), since it won't fit in memory* and from here I can assume that every integer will need 24 bytes to be stored, thus 27 Gb (=39861*28102*24 bytes) with a dense matrix. So, which data structure should I use to store the dataset?


An array of lists?

  • If so( every list will have nodes with two data-members, the index_in_vocubulary and the count), just post a positive answer. If I assume that every document has on average 200 words, then the space would be:

no_of_documents x words_per_doc * no_of_datamembers * 24 = 39861*200*2*24 = 0.4 Gb

  • If not, which one would you propose (which would require less space)?

After storing the dataset, we are required to find k-Nearest Neighbors (k similar documents), with brute force and LSH.


*I have 3.8 GiB in my personal laptop, but I have access to a desktop with ~8Gb RAM.

回答1:

Consider using HDF5 format.

It shall significantly reduce the size of your file.

See my answer to similar question