I am looking at working on an NLP project, in any language (though Python will be my preference).
I want to write a program that will take two documents and determine how similar they are.
As I am fairly new to this and a quick google search does not point me too much. Do you know of any references (websites, textbooks, journal articles) which cover this subject and would be of help to me?
Thanks
The common way of doing this is to transform the documents into tf-idf vectors, then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. Introduction to Information Retrieval, which is free and available online.
Tf-idf (and similar text transformations) are implemented in the Python packages Gensim and scikit-learn. In the latter package, computing cosine similarities is as easy as
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
or, if the documents are plain strings,
>>> vect = TfidfVectorizer(min_df=1)
>>> tfidf = vect.fit_transform([\"I\'d like an apple\",
... \"An apple a day keeps the doctor away\",
... \"Never compare an apple to an orange\",
... \"I prefer scikit-learn to Orange\"])
>>> (tfidf * tfidf.T).A
array([[ 1. , 0.25082859, 0.39482963, 0. ],
[ 0.25082859, 1. , 0.22057609, 0. ],
[ 0.39482963, 0.22057609, 1. , 0.26264139],
[ 0. , 0. , 0.26264139, 1. ]])
though Gensim may have more options for this kind of task.
See also this question.
[Disclaimer: I was involved in the scikit-learn tf-idf implementation.]
Identical to @larsman, but with some preprocessing
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download(\'punkt\') # if necessary...
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
\'\'\'remove punctuation, lowercase, stem\'\'\'
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words=\'english\')
def cosine_sim(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0,1]
print cosine_sim(\'a little bird\', \'a little bird\')
print cosine_sim(\'a little bird\', \'a little bird chirps\')
print cosine_sim(\'a little bird\', \'a big dog barks\')
It\'s an old question, but I found this can be done easily with Spacy. Once the document is read, a simple api similarity
can be used to find the cosine similarity between the document vectors.
import spacy
nlp = spacy.load(\'en\')
doc1 = nlp(u\'Hello hi there!\')
doc2 = nlp(u\'Hello hi there!\')
doc3 = nlp(u\'Hey whatsup?\')
print doc1.similarity(doc2) # 0.999999954642
print doc2.similarity(doc3) # 0.699032527716
print doc1.similarity(doc3) # 0.699032527716
Generally a cosine similarity between two documents is used as a similarity measure of documents. In Java, you can use Lucene (if your collection is pretty large) or LingPipe to do this. The basic concept would be to count the terms in every document and calculate the dot product of the term vectors. The libraries do provide several improvements over this general approach, e.g. using inverse document frequencies and calculating tf-idf vectors. If you are looking to do something copmlex, LingPipe also provides methods to calculate LSA similarity between documents which gives better results than cosine similarity.
For Python, you can use NLTK.
Here\'s a little app to get you started...
import difflib as dl
a = file(\'file\').read()
b = file(\'file1\').read()
sim = dl.get_close_matches
s = 0
wa = a.split()
wb = b.split()
for i in wa:
if sim(i, wb):
s += 1
n = float(s) / float(len(wa))
print \'%d%% similarity\' % int(n * 100)
You might want to try this online service for cosine document similarity http://www.scurtu.it/documentSimilarity.html
import urllib,urllib2
import json
API_URL=\"http://www.scurtu.it/apis/documentSimilarity\"
inputDict={}
inputDict[\'doc1\']=\'Document with some text\'
inputDict[\'doc2\']=\'Other document with some text\'
params = urllib.urlencode(inputDict)
f = urllib2.urlopen(API_URL, params)
response= f.read()
responseObject=json.loads(response)
print responseObject
If you are more interested in measuring semantic similarity of two pieces of text, I suggest take a look at this gitlab project. You can run it as a server, there is also a pre-built model which you can use easily to measure the similarity of two pieces of text; even though it is mostly trained for measuring the similarity of two sentences, you can still use it in your case.It is written in java but you can run it as a RESTful service.
Another option also is DKPro Similarity which is a library with various algorithm to measure the similarity of texts. However, it is also written in java.