I am parsing sentences. I want to know the relevant content of each sentence, defined loosely as "semi-unique words" in relation to the rest of the corpus. Something similar to Amazon's "statistically improbable phrases", which seem to (often) convey the character of a book through oddball strings of words.
My first pass was to start making a common words list. This knocks out the easy ones like a
, the
, from
, etc. Obviously, it turns out that this list gets quite long.
One idea is to generate this list: Make a histogram of the corpus' word frequencies, and lop off the top 10% or something similar (IE the
occurs 700 times, from
600 times, but micropayments
only 50, which is under the cutoff and therefore relevant).
Another algorithim I just learned about from Hacker News today is the Tf idf, which looks like it could be helpful.
What other approaches would work better than my two ideas?
Take a look at this article (Level statistics of words: Finding keywords in literary texts and symbolic sequences, published in Phys. Rev. E).
The picture on the first page together with its caption explain the crucial observation. In Don Quixote, the words "but" and "Quixote" appear with similar frequencies, but their spectra are quite different (occurrences of "Quixote" are clustered while occurrences of "but" are more evenly spaced). Therefore, "Quixote" can be classified as an interesting word (keyword) while "but" is ignored.
It might or might not be what you're looking for, but I guess it won't hurt to be familiar with this result.
I think what Amazon calls "Statiscal Improbable Phrases" are words that are improbable with respect to their huge corpus of data. In effect, even if a word is repeated 1000 times in a given book A, if that book is the only place where it appears, then it is a SIP, because the probability of it appearing in any given book is zilch (because it is specific to book A). You can't really duplicate this wealth of data to compare information from, unless you work yourself with lots of data.
What is lots of data ? Well, if you are analyzing literary texts, then you would want to download and process a couple thousand books from Gutenberg. But if you are analyzing legal texts, then you'd have to specifically feed in the content of legal books.
If, as is probably the case, you don't have lots of data as a luxury, then you have to rely, one way or another, on frequency analysis. But instead of considering relative frequencies (fractions of the text, as is often considered), consider absolute frequencies.
For instance, hapax legomenon also known in the network analysis domain as 1-mice, could be of particular interest. They are words that only appear once in a given text. For instance, in James Joyce's Ulysses, these words only appear once: postexilic, corrosive, romanys, macrocosm, diaconal, compressibility, aungier. They are not statistical improbable phrases (as would be "Leopold Bloom") so they don't characterize the book. But they are terms that are rare enough that they only appear once in this writer's expression, so you can consider that they characterize, in a way, his expression. They are words that, unlike common words like "the", "color", "bad", etc. he expressly sought to use.
So these are an interesting artifact, and the thing is, they are pretty easy to extract (think O(N) with constant memory), unlike other, more complex, indicators. (And if you want elements which are slightly more frequent, then you can turn to 2-mice, ..., 10-mice which are similarly easy to extract.)
TF-IDF is one way to go. If you want to talk about sentences rather than words, in addition to the excellent references above, here's a simple scheme:
Create a markov chain from a large sample corpus. In a nutshell, you construct a markov chain by recording the frequency of every n-tuple in your input text. For example, the sentence "this is a test" with 3-tuples would be (this, is, a), (is, a, test). Then, you group every n-tuple by the first n-1 terms, allowing you to answer the question "given the preceding n-1 words, what is the probability of the next word being this?"
Now, for every sentence in the input document, traverse the Markov chain. Calculate the probability of seeing the sentence by multiplying all the probabilities you encounter while traversing the chain together. This gives you an estimate of how 'probable' this sentence is in the input corpus. You may want to multiply this probability by the length of the sentence, as longer sentences are less likely, statistically.
Now you have associated with each sentence in your input a probability. Pick the n least probable sentences - these are the 'interesting' ones, for some definition of interesting.
Latent Semantic Analysis (LSA)
http://www.puffinwarellc.com/index.php/news-and-articles/articles/33.html