I need to cluster some text documents and have been researching various options. It looks like LingPipe can cluster plain text without prior conversion (to vector space etc), but it's the only tool I've seen that explicitly claims to work on strings.
Are there any Python tools that can cluster text directly? If not, what's the best way to handle this?
There is Python library NLTK that supports linguistic analysis including clustering text
It seems to be possible by using simple UNIX command line tools to extract the text contents of those documents into text files, then using a pure Python solution for the actual clustering.
I found a code snippet for clustering data in general:
http://www.daniweb.com/code/snippet216641.html
A Python package for this:
http://python-cluster.sourceforge.net/
Another python package (used mainly for bioinformatics):
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm#pycluster
The quality of text-clustering depends mainly on two factors:
Some notion of similarity between the documents you want to cluster. For example, it's easy to distinguish between newsarticles about sports and politics in vector space via tfidf-cosine-distance. It's a lot harder to cluster product-reviews in "good" or "bad" based on this measure.
The clustering method itself. You know how many cluster there'll be? Ok, use kmeans. You don't care about accuracy but want to show a nice tree-structure for navigation of search-results? Use some kind of hierarchical clustering.
There is no text-clustering solution, that would work well under any circumstances. And therefore it's probably not enough to take some clustering software out of the box and throw your data at it.
Having said that, here's some experimental code i used some time ago to play around with text-clustering. The documents are represented as normalized tfidf-vectors and the similarity is measured as cosine distance. The clustering method itself is majorclust.
For real applications, you would use a decent tokenizer, use integers instead of token-strings and don't calc a O(n^2) distance-matrix...