I am implementing the tf-idf algorithm in a web application using Python, however it runs extremely slow. What I basically do is:
1) Create 2 dictionaries:
- First dictionary: key (document id), value (list of all found words (incl. repeated) in doc)
- Second dictionary; key (document id), value (set containing unique words of the doc)
Now, there is a petition of a user to get tfidf results of document d. What I do is:
2) Loop over the unique words of the second dictionary for the document d, and for each unique word w get:
2.1) tf score (how many times w appears in d: loop over the the list of words of the first dictionary for the document)
2.2) df score (how many docs contain w: looping over the set of words of all documents (second dictionary) and check if w is contained). I am using a set since it seems to be faster for checking if a set contains a word compared to a list.
Step 2.2 is terribly slow. For example, having 1000 documents, and for a document with 2313 unique words, it takes around 5 minutes to output the results.
Is there any other way to make step 2.2 faster? Are dictionaries that slow for iterating?
Well, you have to re-think and re-design somehow the way you hold your data, or in other words, implement an "orthodox" version of your "inverted index".
Your bottleneck is the "on-the-fly" calculation of the document frequency (DF) for the terms. It would be a clever idea for this to be dynamic, so every time you update your corpus (collection of documents), do some processing and update the DFs for every term in a document (and of course, save the results in a persistent way, aka a database etc..) .
The only structure you need is a nested dictionary like that
{ "term1" : { "DF" : x, "some_doc_id" : tf , "some_other_doc_id" : tf, etc } ,
"term2" : ...
etc..
}
properly updated every time you "feed" your corpus.
And of course, keep somewhere your corpus cardinality...
As a hobby and part of my work, I am implementing a python - redis backed small search engine. You might get some other ideas as well. Take a look here.
Is this an academic endeavor or are you doing it for production? If you're implementing for production, why not using something already available (i.e. http://code.google.com/p/tfidf/)? On the other hand, if you're doing it as an academic exercise, I might still take a gander at an existing implementation to see what they're doing differently (if anything at all).
I'd also suggest using cProfile
to profile your code to see where the expense is.