I am looking at working on an NLP project, in any language (though Python will be my preference).
I want to write a program that will take two documents and determine how similar they are.
As I am fairly new to this and a quick google search does not point me too much. Do you know of any references (websites, textbooks, journal articles) which cover this subject and would be of help to me?
Thanks
The common way of doing this is to transform the documents into tf-idf vectors, then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. Introduction to Information Retrieval, which is free and available online.
Tf-idf (and similar text transformations) are implemented in the Python packages Gensim and scikit-learn. In the latter package, computing cosine similarities is as easy as
or, if the documents are plain strings,
though Gensim may have more options for this kind of task.
See also this question.
[Disclaimer: I was involved in the scikit-learn tf-idf implementation.]
Generally a cosine similarity between two documents is used as a similarity measure of documents. In Java, you can use Lucene (if your collection is pretty large) or LingPipe to do this. The basic concept would be to count the terms in every document and calculate the dot product of the term vectors. The libraries do provide several improvements over this general approach, e.g. using inverse document frequencies and calculating tf-idf vectors. If you are looking to do something copmlex, LingPipe also provides methods to calculate LSA similarity between documents which gives better results than cosine similarity. For Python, you can use NLTK.
You might want to try this online service for cosine document similarity http://www.scurtu.it/documentSimilarity.html
It's an old question, but I found this can be done easily with Spacy. Once the document is read, a simple api
similarity
can be used to find the cosine similarity between the document vectors.Identical to @larsman, but with some preprocessing
Here's a little app to get you started...