Calculating tf-idf among documents using python 2.

2019-01-27 00:34发布

问题:

I have a scenario where i have retreived information/raw data from the internet and placed them into their respective json or .txt files.

From there on i would like to calculate the frequecies of each term in each document and their cosine similarity by using tf-idf.

For example: there are 50 different documents/texts files that consists 5000 words/strings each i would like to take the first word from the first document/text and compare all the total 250000 words find its frequencies then do so for the second word and so on for all 50 documents/texts.

Expected output of each frequecy will be from 0 -1

How am i able to do so. I have been referring to sklear package but most of them only consists of a few strings in each comparisons.

回答1:

You really should show us your code and explain in more detail which part it is that you are having trouble with.

What you describe is not usually how it's done. What you usually do is vectorize documents, then compare the vectors, which yields the similarity between any two documents under this model. Since you are asking about NLTK, I will proceed on the assumption that you want this regular, traditional method.

Anyway, with a traditional word representation, cosine similarity between two words is meaningless -- either two words are identical, or they're not. But there are certainly other ways you could approach term similarity or document similarity.

Copying the code from https://stackoverflow.com/a/23796566/874188 so we have a baseline:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer._tfidf.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

There is nothing here which depends on the length of the input. The number of features in idf will be larger if you have longer documents and there will be more of them in the corpus if you have more documents, but the algorithm as such will not need to change at all to accommodate more or longer documents.

If you don't want to understand why, you can stop reading here.

The vectors are basically an array of counts for each word form. The length of each vector is the number of word forms (i.e. the number of features). So if you have a lexicon with six entries like this:

0: a
1: aardvark
2: banana
3: fruit
4: flies
5: like

then the input document "a fruit flies like a banana" will yield a vector of six elements like this:

[2, 0, 1, 1, 1, 1]

because there are two occurrences of the word at index zero in the lexicon, zero occurrences of the word at index one, one of the one at index two, etc. This is a TF (term frequency) vector. It is already a useful vector; you can compare two of them using cosine distance, and obtain a measurement of their similarity.

The purpose of the IDF factor is to normalize this. The normalization brings three benefits; computationally, you don't need to do any per-document or per-comparison normalization, so it's faster; and the algorithm also normalizes frequent words so that many occurrences of "a" is properly regarded as insignificant if most documents contain many occurrences of this word (so you don't have to do explicit stop word filtering), whereas many occurrences of "aardvark" is immediately obviously significant in the normalized vector. Also, the normalized output can be readily interpreted, whereas with plain TF vectors you have to take document length etc. into account to properly understand the result of the cosine similarity comparison.

So if the DF (document frequency) of "a" is 1000, and the DF of the other words in the lexicon is 1, the scaled vector will be

[0.002, 0, 1, 1, 1, 1]

(because we take the inverse of the document frequency, i.e. TF("a")*IDF("a") = TF("a")/DF("a") = 2/1000).

The cosine similarity basically interprets these vectors in an n-dimensional space (here, n=6) and sees how far from each other their arrows are. Just for simplicity, let's scale this down to three dimensions, and plot the (IDF-scaled) number of "a" on the X axis, the number of "aardvark" occurrences on the Y axis, and the number of "banana" occurrences on the Z axis. The end point [0.002, 0, 1] differs from [0.003, 0, 1] by just a tiny bit, whereas [0, 1, 0] ends up at quite another corner of the cube we are imagining, so the cosine distance is large. (The normalization means 1.0 is the maximum of any element, so we are talking literally a corner.)

Now, returning to the lexicon, if you add a new document and it has words which are not already in the lexicon, they will be added to the lexicon, and so the vectors will need to be longer from now on. (Vectors you already created which are now too short can be trivially extended; the term weight for the hitherto unseen terms will obviously always be zero.) If you add the document to the corpus, there will be one more vector in the corpus to compare against. But the algorithm doesn't need to change; it will always create vectors with one element per lexicon entry, and you can continue to compare these vectors using the same methods as before.

You can of course loop over the terms and for each, synthesize a "document" consisting of just that single term. Comparing it to other single-term "documents" will yield 0.0 similarity to the others (or 1.0 similarity to a document containing the same term and nothing else), so that's not too useful, but a comparison against real-world documents will reveal essentially what proportion of each document consists of the term you are examining.

The raw IDF vector tells you the relative frequency of each term. It usually expresses how many documents each term occurred in (so even if a term occurs more than once in a document, it only adds 1 to the DF for this term), though some implementations also allow you to use the bare term count.