Ruby Text Analysis

2019-02-04 21:11发布

问题:

Is there any Ruby gem or else for text analysis? Word frequency, pattern detection and so forth (preferably with an understanding of french)

回答1:

the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world tripples), ..., in general: n-grams

You should look for an existing toolkit for Language Models - not a good idea to re-invent the wheel here.

There are a few standard toolkits available, e.g. from the CMU Sphinx team, and also HTK.

These toolkits are typically written in C (for speed!! because you have to process huge corpora) and generate standard output format ARPA n-gram files (those are typically a text format)

Check the following thread, which contains more details and links:

Building openears compatible language model

Once you generated your Language Model with one of these toolkits, you will need either a Ruby Gem which makes the language model accessible in Ruby, or you need to convert the ARPA format into your own format.

adi92's post lists some more Ruby NLP resources.

You can also Google for "ARPA Language Model" for more info

Last not least check Google's N-gram tool online: http://ngrams.googlelabs.com/ soon to move to: http://books.google.com/ngrams

They built n-grams based on the books they digitized -- also available in French and other languages!



回答2:

http://mendicantbug.com/2009/09/13/nlp-resources-for-ruby/ contains lots of useful Ruby NLP links.
I had tried using the Ruby Linguistics stuff a long time ago, and remember having a lot of problems with it... I don't recommend jumping into that.
If most of your text analysis involves stuff like counting ngrams and naive bayes, I recommend just doing it on your own. Ruby has pretty good basic libraries and awesome support for regexes, so this should not be that tricky, and it will be easier for you to adapt stuff to the idiosyncrasies of the problem you are trying to solve.
Like the stanford parser gem, its possible to use java libraries that solve your problem from within ruby, but this can be tricky, so probably not the best way to solve a problem.



回答3:

I wrote the gem words_counted for this reason. You can see a demo on rubywordcount.com. It has a lot of the analysis features you mention, and a host more. The API is well documented and can be found in the readme on Github.