We give this image to our users:
This picture is representing separate numbers. And all of our users read it as "11-0-9-5" into their microphones.
We use Google Speech Engine, and it interprets this result:
"1109 5".
This makes it impossible for us to compare the spoken words with the expected result. And we're stuck in this phase.
Is there a way to tell Google's Speech Recognition to understand spoken numbers literally and separately, and do not join them together?
You can try using speech context so that you constraint the GoogleSpeechEngine to stick to predefined numbers. https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig#SpeechContext
So if you specify 0,1,2,3,4,5,6,7,8,9,10,11 as possible phrases google should not send back 1109 as it is not in the context.
However using this method you have to list all possible values which can be tedious. Some cases won't be solved. For exemple if someone is ponouncing 11 as 1-1.