Voice control API - high accuracy on specific phra

2020-07-17 07:11发布

问题:

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 7 years ago.

I have several ideas for voice controlled apps. Unfortunately, based on what I've seen from Siri and Google Voice Actions, the technology doesn't seem to quite be there yet. Even in a perfectly quiet environment, the accuracy is so bad, that it often feels much easier to type it into your phone.

One way to make the task easier would be to limit the system to a couple of commands, specifically chosen to sound very different, as opposed to passing the sound to a service and just getting the text back.

So the requirements I have are:

  • Very high accuracy when asked to work with a limited set of commands
  • Preferable for it to work on mobile devices, but PC only libraries may be useful too
  • Offline is again preferable, but not necessary
  • No need to be open source - licensing is fine

Does such an API or software exist?

回答1:

I have been recently involved in a project developing a platform for mobile grammar-based speech recognition applications, with the following features:

  • The grammars are written in Grammatical Framework, see: http://kaljurand.github.com/Grammars/
  • The server is based on Sphinx, see: https://github.com/alumae/ruby-pocketsphinx-server
  • The server can be accessed from Android, see: https://code.google.com/p/recognizer-intent/

All the components are open source and it shouldn't be too hard to set up your own server and port the system to your language, given that you have the acoustic models for that language.



回答2:

VoiceXML and SRGS might be a good starting point for your search. There's not much in the world of open-source, sadly, because getting this sort of stuff "right" will mean a big payday.



回答3:

Using a speech recognition system that supports grammars (SRGS) will increase your recognition rate. Grammars restrict the search space by specifying expected words and phrases as rules that the speech recognition system uses to get a match and therefore can increase performance and recognition rate.

VoiceXML is a good language for developing speech applications that use a telephone as a mode of interaction. What I mean by using a telephone as a mode of interaction is that the user actually dials an IVR system which answers the call and then starts interacting with the user through recorded audio prompts and user input through speech or telephone key pad input. VoiceXML is not intended for mobile apps that have visual interfaces like a native Android application or a web application. To develop visual applications that use speech you could use something like Nuance's mobile tool, which can have a hefty price tag. Or something open source like Sphinx.



回答4:

Most cloud-based APIs for speech recognition (Google, AT&T, Siri, etc.) do not allow a custom SRGS grammar to be used to improve accuracy. That is really unfortunate.

One possibility is to combine two technologies from Voxeo, namely Tropo and Phono. The former is an API-based voice platform that is much easier to use than VoiceXML platforms, and the latter is jQuery plugin for making (and controlling) voice calls from your browser. Tropo supports SRGS grammars.