Similarity between two text documents

2019-01-01 04:50发布

I am looking at working on an NLP project, in any language (though Python will be my preference).

I want to write a program that will take two documents and determine how similar they are.

As I am fairly new to this and a quick google search does not point me too much. Do you know of any references (websites, textbooks, journal articles) which cover this subject and would be of help to me?

Thanks

标签: nlp
7条回答
怪性笑人.
2楼-- · 2019-01-01 04:54

The common way of doing this is to transform the documents into tf-idf vectors, then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. Introduction to Information Retrieval, which is free and available online.

Tf-idf (and similar text transformations) are implemented in the Python packages Gensim and scikit-learn. In the latter package, computing cosine similarities is as easy as

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

or, if the documents are plain strings,

>>> vect = TfidfVectorizer(min_df=1)
>>> tfidf = vect.fit_transform(["I'd like an apple",
...                             "An apple a day keeps the doctor away",
...                             "Never compare an apple to an orange",
...                             "I prefer scikit-learn to Orange"])
>>> (tfidf * tfidf.T).A
array([[ 1.        ,  0.25082859,  0.39482963,  0.        ],
       [ 0.25082859,  1.        ,  0.22057609,  0.        ],
       [ 0.39482963,  0.22057609,  1.        ,  0.26264139],
       [ 0.        ,  0.        ,  0.26264139,  1.        ]])

though Gensim may have more options for this kind of task.

See also this question.

[Disclaimer: I was involved in the scikit-learn tf-idf implementation.]

查看更多
裙下三千臣
3楼-- · 2019-01-01 04:58

Generally a cosine similarity between two documents is used as a similarity measure of documents. In Java, you can use Lucene (if your collection is pretty large) or LingPipe to do this. The basic concept would be to count the terms in every document and calculate the dot product of the term vectors. The libraries do provide several improvements over this general approach, e.g. using inverse document frequencies and calculating tf-idf vectors. If you are looking to do something copmlex, LingPipe also provides methods to calculate LSA similarity between documents which gives better results than cosine similarity. For Python, you can use NLTK.

查看更多
孤独总比滥情好
4楼-- · 2019-01-01 04:59

You might want to try this online service for cosine document similarity http://www.scurtu.it/documentSimilarity.html

import urllib,urllib2
import json
API_URL="http://www.scurtu.it/apis/documentSimilarity"
inputDict={}
inputDict['doc1']='Document with some text'
inputDict['doc2']='Other document with some text'
params = urllib.urlencode(inputDict)    
f = urllib2.urlopen(API_URL, params)
response= f.read()
responseObject=json.loads(response)  
print responseObject
查看更多
其实,你不懂
5楼-- · 2019-01-01 05:02

It's an old question, but I found this can be done easily with Spacy. Once the document is read, a simple api similarity can be used to find the cosine similarity between the document vectors.

import spacy
nlp = spacy.load('en')
doc1 = nlp(u'Hello hi there!')
doc2 = nlp(u'Hello hi there!')
doc3 = nlp(u'Hey whatsup?')

print doc1.similarity(doc2) # 0.999999954642
print doc2.similarity(doc3) # 0.699032527716
print doc1.similarity(doc3) # 0.699032527716
查看更多
栀子花@的思念
6楼-- · 2019-01-01 05:13

Identical to @larsman, but with some preprocessing

import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('punkt') # if necessary...


stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

'''remove punctuation, lowercase, stem'''
def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

def cosine_sim(text1, text2):
    tfidf = vectorizer.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0,1]


print cosine_sim('a little bird', 'a little bird')
print cosine_sim('a little bird', 'a little bird chirps')
print cosine_sim('a little bird', 'a big dog barks')
查看更多
大哥的爱人
7楼-- · 2019-01-01 05:15

Here's a little app to get you started...

import difflib as dl

a = file('file').read()
b = file('file1').read()

sim = dl.get_close_matches

s = 0
wa = a.split()
wb = b.split()

for i in wa:
    if sim(i, wb):
        s += 1

n = float(s) / float(len(wa))
print '%d%% similarity' % int(n * 100)
查看更多
登录 后发表回答