In Weka, class StringToWordVector defines a method called setNormalizeDocLength. It normalizes word frequencies of a document. My questions are:
- what is meant by "normalizing word frequency of a document"?
- How Weka does this?
A practical example will help me best. Thanks in advance.
Looking in the Weka source, this is the method that does the normalising:
It looks like the most relevant part is
So it looks like the normalisation is
value = currentValue * averageDocumentLength / actualDocumentLength
.