I've seen a number of questions on making histograms in clean one-liners, but I haven't yet found anyone trying to make them as efficiently as possible. I'm currently creating a lot of tfidf vectors for a search algorithm, and this involves creating a number of histograms and my current code, while being very short and readable is not as fast as I would like. Sadly, I've tried a number of other methods that turned out far slower. Can you do it faster? cleanStringVector is a list of strings (all lowercase, no punctuation), and masterWordList is also a list of words that should contain every word within the cleanStringVector.
from collections import Counter
def tfidfVector(cleanStringVector, masterWordList):
frequencyHistogram = Counter(cleanStringVector)
featureVector = [frequencyHistogram[word] for word in masterWordList]
return featureVector
Worth noting that the fact that the Counter object returns a zero for non-existent keys instead of raising a KeyError is a serious plus and most of the histogram methods in other questions fail this test.
Example: If I have the following data:
["apple", "orange", "tomato", "apple", "apple"]
["tomato", "tomato", "orange"]
["apple", "apple", "apple", "cucumber"]
["tomato", "orange", "apple", "apple", "tomato", "orange"]
["orange", "cucumber", "orange", "cucumber", "tomato"]
And a master wordlist of:
["apple", "orange", "tomato", "cucumber"]
I would like a return of the following from each test case respectively:
[3, 1, 1, 0]
[0, 1, 2, 0]
[3, 0, 0, 1]
[2, 2, 2, 0]
[0, 2, 1, 2]
I hope that helps.
Approximate final results:
Original Method: 3.213
OrderedDict: 5.529
UnorderedDict: 0.190