Google Cloud Text-to-speech word timestamps

2020-06-08 14:49发布

站内文章 / 前端开发

122 0

爷、活的狠高调

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm generating speech through Google Cloud's text-to-speech API and I'd like to highlight words as they are spoken.

Is there a way of getting timestamps for spoken words or sentences?

回答1:

This question seems to have gotten quite popular so I thought I'd share what I ended up doing. This method will probably only work with English or similar languages.

I first split text on any punctuation that causes a break in speaking. Each "sentence" is converted to speech separately. The resulting audio files have a seemingly random amount of silence at the end which needs to be removed before joining them, this can be done with the FFmpeg silencedetect filter. You can then join the audio files with an appropriate gap. Approximate word timestamps can be linearly interpolated within the sentences.