How to store this collection of documents?

2019-07-23 03:00发布

问题:

The dataset is like this:

39861    // number of documents
28102    // number of words of the vocabulary (another file)
3710420  // number of nonzero counts in the bag-of-words
1 118 1  // document_id index_in_vocabulary count
1 285 3
...
2 46 1
...
39861 27196 5

We are advised not to store that in matrix (of size 39861 x 39861 I guess), since it won't fit in memory^* and from here I can assume that every integer will need 24 bytes to be stored, thus 27 Gb (=39861*28102*24 bytes) with a dense matrix. So, which data structure should I use to store the dataset?

An array of lists?

If so( every list will have nodes with two data-members, the index_in_vocubulary and the count), just post a positive answer. If I assume that every document has on average 200 words, then the space would be: