I gather Text documents (in Node.js) where one document i
is represented as a list of words.
What is an efficient way to compute the similarity between these documents, taking into account that new documents are coming as a sort of stream of documents?
I currently use cos-similarity on the Normalized Frequency of the words within each document. I don't use the TF-IDF (Term frequency, Inverse document frequency) because of the scalability issue since I get more and more documents.
Initially
My first version was to start with the currently available documents, compute a big Term-Document matrix A
, and then compute S = A^T x A
so that S(i, j)
is (after normalization by both norm(doc(i))
and norm(doc(j))
) the cos-similarity between documents i
and j
whose word frequencies are respectively doc(i)
and doc(j)
.
For new documents
What do I do when I get a new document doc(k)
? Well, I have to compute the similarity of this document with all the previous ones, which doesn't require to build a whole matrix. I can just take the inner-product of doc(k) dot doc(j)
for all previous j
, and that result in S(k, j)
, which is great.
The troubles
Computing
S
in Node.js is really long. Way too long in fact! So I decided to create a C++ module which would do the whole thing much faster. And it does! But I cannot wait for it, I should be able to use intermediate results. And what I mean by "not wait for it" is botha. wait for the computation to be done, but also
b. wait for the matrixA
to be built (it's a big one).Computing new
S(k, j)
can take advantage of the fact that documents have way less words than the set of all the given words (which I use to build the whole matrixA
). Thus, it looks faster to do it in Node.js, avoiding a lot of extra-resource to be taken to access the data.
But is there any better way to do that?
Note : the reason I started computing S
is that I can easily build A
in Node.js where I have access to all the data, and then do the matrix multiplication in C++ and get it back in Node.js, which speeds the whole thing a lot. But now that computing S
gets impracticable, it looks useless.
Note 2 : yep, I don't have to compute the whole S
, I can just compute the upper-right elements (or the lower-left ones), but that's not the issue. The time computation issue is not of that order.