Creating ARPA language model file with 50,000 word

2019-03-08 21:56发布

问题:

I want to create an ARPA language model file with nearly 50,000 words. I can't generate the language model by passing my text file to the CMU Language Tool. Is any other link available where I can get a language model for these many words?

回答1:

I thought I'd answer this one since it has a few votes, although based on Christina's other questions I don't think this will be a usable answer for her since a 50,000-word language model almost certainly won't have an acceptable word error rate or recognition speed (or most likely even function for long) with in-app recognition systems for iOS that use this format of language model currently, due to hardware constraints. I figured it was worth documenting it because I think it may be helpful to others who are using a platform where keeping a vocabulary this size in memory is more of a viable thing, and maybe it will be a possibility for future device models as well.

There is no web-based tool I'm aware of like the Sphinx Knowledge Base Tool that will munge a 50,000-word plaintext corpus and return an ARPA language model. But, you can obtain an already-complete 64,000-word DMP language model (which can be used with Sphinx at the command line or in other platform implementations in the same way as an ARPA .lm file) with the following steps:

  1. Download this language model from the CMU speech site:

http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20HUB4%20Language%20Model/HUB4_trigram_lm.zip

In that folder is a file called language_model.arpaformat.DMP which will be your language model.

  1. Download this file from the CMU speech site, which will become your pronunciation dictionary:

https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic

Convert the contents of cmu07a.dic to all uppercase letters.

If you want, you could also trim down the pronunciation dictionary by removing any words from it which aren't found in the corpus language_model.vocabulary (this would be a regex problem). These files are intended for use with one of the Sphinx English-language acoustic models.

If the desire to use a 50,000-word English language model is driven by the idea of doing some kind of generalized large vocabulary speech recognition and not by the need to use a very specific 50,000 words (for instance, something specialized like a medical dictionary or 50,000-entry contact list), this approach should give those results if the hardware can handle it. There are probably going to be some Sphinx or Pocketsphinx settings that will need to be changed which will optimize searches through this size of model.



回答2:

How big is your training corpus? if it's only 50000 words, that would be tiny / too small.

In general, you could either use the toolkit from CMU or HTK

Detailed documentation for the HTK Speech Recognition Toolkit here: http://htk.eng.cam.ac.uk/ftp/software/htkbook_html.tar.gz

Here's also a description of CMU's SLM Toolkit: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html

see also: Building openears compatible language model

You could take a more general Language Model, based on a bigger corpus and interpolate your smaller Language Model with it .. e.g a back-off language model ... but that's not a trivial task.

see: http://en.wikipedia.org/wiki/Katz's_back-off_model