I'm working on text classification and I want to use Topic models (LDA). My corpus consists of at least 24, 000 Persian news documents. each doc in the corpus is in format of (keyword, weight) pairs extracted from the news.
I saw two Java toolkits: mallet and lingpipe. I've read mallet tutorial on importing the data and it gets data in plain text, not the format that I have. is there any way that I could change it?
Also read a little about the lingpipe, the example from tutorial was using arrays of integers. Is it convenient for large data?
I need to know which implementation of LDA is better for me? Are there any other implementation that suits my data? (in Java)
From the keyword-weight file you can create an artificial text containing the words in random order with the given weights. Run mallet on the so-generated texts to retrieve the topics.