Spark MLLIb has a HashingTF() function that computes document term frequencies based on a hashed value of each of the terms.
1) what function does it use to do the hashing?
2) How can I achieve the same hashed value from Python?
3) If I want to compute the hashed output for a given single input, without computing the term frequency, how can I do this?
If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows:
def indexOf(self, term):
""" Returns the index of the input term. """
return hash(term) % self.numFeatures
As you can see it is just a plain old hash
module number of buckets.
Final hash is just a vector of counts per bucket (I've omitted docstring and RDD case for brevity):
def transform(self, document):
freq = {}
for term in document:
i = self.indexOf(term)
freq[i] = freq.get(i, 0) + 1.0
return Vectors.sparse(self.numFeatures, freq.items())
If you want to ignore frequencies then you can use set(document)
as an input, but I doubt there is much to gain here. To create set
you'll have to compute hash
for each element anyway.
It seems to me that there is something else going on under the hood other than what the source that zero323 linked. I found that hashing and then doing the modulus as the source code did wouldn't give me the same indices as hashingTF generates. At least for single characters, what I had to do was convert the char to the ascii code, like so: (Python 2.7)
index = ord('a') # 97
Which corresponds to what hashingtf outputs for the index. If I did the same thing as hashingtf appears to do, which is:
index = hash('a') % 1<<20 # 897504
I would get very clearly the wrong index.