I'm looking for a Java library to extract keywords from a block of text.
The process should be as follows:
stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more times in the text than in the English language in terms of probability than it's a keyword candidate.
Is there a library that performs this task?
Here is a possible solution using Apache Lucene. I didn't use the last version but the 3.6.2 one, since this is the one I know the best. Besides the
/lucene-core-x.x.x.jar
, don't forget to add the/contrib/analyzers/common/lucene-analyzers-x.x.x.jar
from the downloaded archive to your project: it contains the language-specific analyzers (especially the English one in your case).Note that this will only find the frequencies of the input text words based on their respective stem. Comparing these frequencies with the English language statistics shall be done afterwards (this answer may help by the way).
The data model
One keyword for one stem. Different words may have the same stem, hence the
terms
set. The keyword frequency is incremented every time a new term is found (even if it has been already found - a set automatically removes duplicates).Utilities
To stem a word:
To search into a collection (will be used by the list of potential keywords):
Core
Here is the main input method:
Example
Using the
guessFromString
method on the Java wikipedia article introduction part, here are the first 10 most frequent keywords (i.e. stems) that were found:Iterate over the output list to know which were the original found words for each stem by getting the
terms
sets (displayed between brackets[...]
in the above example).What's next
Compare the stem frequency / frequencies sum ratios with the English language statistics ones, and keep me in the loop if your managed it: I could be quite interested too
:)
An updated and ready-to-use version of the code proposed above.
This code is compatible with
Apache Lucene
5.x…6.x.CardKeyword class:
KeywordsExtractor class:
The call of function: