The dataset is like this:
39861 // number of documents
28102 // number of words of the vocabulary (another file)
3710420 // number of nonzero counts in the bag-of-words
1 118 1 // document_id index_in_vocabulary count
1 285 3
...
2 46 1
...
39861 27196 5
We are advised not to store that in matrix (of size 39861 x 39861 I guess), since it won't fit in memory* and from here I can assume that every integer will need 24 bytes to be stored, thus 27 Gb (=39861*28102*24 bytes) with a dense matrix. So, which data structure should I use to store the dataset?
An array of lists?
- If so( every list will have nodes with two data-members, the
index_in_vocubulary
and thecount
), just post a positive answer. If I assume that every document has on average 200 words, then the space would be:
no_of_documents x words_per_doc * no_of_datamembers * 24 = 39861*200*2*24 = 0.4 Gb
- If not, which one would you propose (which would require less space)?
After storing the dataset, we are required to find k-Nearest Neighbors (k similar documents), with brute force and LSH.
*I have 3.8 GiB in my personal laptop, but I have access to a desktop with ~8Gb RAM.