I am using the gensim library to apply LDA to a set of documents. Using gensim I can apply LDA to a corpus whatever the term weights are: binary, tf, tf-idf...
My question is, what is the term weighting that should be used for the original LDA? If I have understood correctly the weights should be term frequencies, but I am not sure.
It should be a corpus represented as a "bag of words". Or, yes, lists of term counts.
The correct format is that of the corpus
defined in the first tutorial on the Gensim webpage (these are really useful).
Namely, if you have a dictionary
as defined in Radim's tutorial, and the following documents,
doc1 = ['big', 'data', 'technique', 'lots', 'of', 'cash']
doc2 = ['this', 'document', 'has', 'words']
docs = [doc1, doc2]
then your corpus (for use with LDA) should be an iterable object (such as a list) of lists of tuples of the form: (dictKey, count)
, where dk
refers to the dictionary key of a term, and count is the number of times it occurs in the document. This is done for you with
corpus = [dictionary.doc2bow(doc) for doc in docs]
That doc2bow
function means "document to bag of words".