How do I extract keywords used in text? [closed]

2019-01-09 22:12发布

问题:

How do I data mine a pile of text to get keywords by usage? ("Jacob Smith" or "fence")

And is there a software to do this already? even semi-automatically, and if it can filter out simple words like "the", "and", "or", then I could get to the topics quicker.

回答1:

The general algorithm is going to go like this:

- Obtain Text
- Strip punctuation, special characters, etc.
- Strip "simple" words
- Split on Spaces
- Loop Over Split Text
    - Add word to Array/HashTable/Etc if it doesn't exist;
       if it does, increment counter for that word

The end result is a frequency count of all words in the text. You can then take these values and divide by the total number of words to get a percentage of frequency. Any further processing is up to you.

You're also going to want to look into Stemming. Stemming is used to reduce words to their root. For example going => go, cars => car, etc.

An algorithm like this is going to be common in spam filters, keyword indexing and the like.



回答2:

This is an open question in NLP, so there is no simple answer.

My recommendation for quick-and-dirty "works-for-me" is topia.termextract.

Yahoo has a keyword extraction service (http://developer.yahoo.com/search/content/V1/termExtraction.html) which is low recall but high precision. In other words, it gives you a small number of high quality terms, but misses many of the terms in your documents.

In Python, there is topia.termextract (http://pypi.python.org/pypi/topia.termextract/). It is relatively noisy, and proposes many bogus keywords, but it simple to use.

Termine (http://www.nactem.ac.uk/software/termine/) is a UK webservice that also is relatively noisy, and proposes many bogus keywords. However, it appears to me to be slightly more accurate than topia.termextract. YMMV.

One way to denoise results with too many keywords (e.g. topia.termextract and termine) is to create a vocabulary of terms that occur frequently, and then throw out proposed terms that are not in the vocabulary. In other words, do two passes over your corpus: The first pass, count the frequency of each keywords. In the second pass, discard the keywords that are too rare.

If you want to write your own, perhaps the best introduction is written by Park, who is now at IBM:

  • "Automatic glossary extraction: beyond terminology identification" available at http://portal.acm.org/citation.cfm?id=1072370
  • "Glossary extraction and utilization in the information search and delivery system for IBM technical support"

Here are some more references, if you want to learn more:

  • http://en.wikipedia.org/wiki/Terminology_extraction
  • "CorePhrase: Keyphrase Extraction for Document Clustering"
  • Liu et al 2009 from NAACL HLT
  • "Automatic Identification of Non-compositional Phrases"
  • "Data Mining Meets Collocations Discovery"
  • As well as a host of other references you can dig up on the subject.


回答3:

There is also a service called Alchemy that can do term-extraction, concept tagging, sentiment analysis and so on.

It's valid, I tested it but I don't know they commercial policies (if any). They provide APIs for any sort of language (pretty much).

I read somewhere (sorry I don't remember where anymore) that the output given by Alchemy is less noisy compared to those proposed by Joseph.



回答4:

You did not specify a technology you're working with, so I guess a shell script is also a possibility.

I've always been impressed by the word frequency analysis example in the Advanced Bash-Scripting Guide (12-11)

The following for example fetches a book from project Gutenburg and writes out a word frequency analysis 'report':

wget http://www.gutenberg.org/files/20417/20417-8.txt -q -O- | 
sed -e 's/\.//g'  -e 's/\,//g' -e 's/ /\
/g' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr > output.txt

Should be extendable to exclude words from a 'common' list (the, and, a...) etc.



回答5:

I personally recommend Maui (http://code.google.com/p/maui-indexer/): it relies on KeA but extends it in a variety of ways. It is trainable and can use RDF formatted terminologies.



回答6:

I've used NTLK to recognize named entities before with some success. It is especially good at recognizing people's and organization's names.