How to compare two audio data?

2019-03-30 17:57发布

问题:

I will record my own voice and save them as wav files in my computer. Later on I will speak and computer should match my voice command with preexisting/pre-recorded wav files..

Question: How to check two audio data are equal or there is 80%match between two audio?

if(audio1 == audio2)
   DO Task A
else if( audio1 is a bit similar to audio 2)
   DO TASK B
else if( audio1 (80% match) audio 2)
   DO TASK C
end if

What is the best way to compare two audio data?

回答1:

Unfortunately you won't get anywhere very quickly just trying to compare audio waveforms directly. There is a huge amount of research on speech and speaker recognition and you'll just be re-inventing the wheel if you don't familiarise yourself with the basics. I think you have several choices here depending on what you really want to do

  • Start reading about HMMs, DTW (as mentioned by learnvst), and Mel-frequency Cepstral Coefficients to know where to start.
  • Use an existing speech API such as the Microsoft one which takes care of the low level signal processing, which you can build into your application
  • Use something even higher level such as the Windows Speech Recognition Macros which give you the ability to control aspects of your PC via speech (eg 'Play Purple Haze')

It depends whether you want to learn about the low levels of speech processing (which will involve a significant amount mathematics), or whether you just want something that works with little coding.



回答2:

You can find some ideas from Homemade Speech Recognition . This is for .NET compact framework, but can easily be adapted to plain vanilla .NET. The solution is based on Fast Fourier Transform.



回答3:

By similar, do you mean purely numerically? In which case a cross correlation type analysis might suffice. Otherwise, if you mean similar in terms of a human listeners auditory perception of the sound sample then you need to read up on acoustic fingerprinting.

EDIT:

I'm guessing from your update that you want to do a simple form of speech recognition, correct? If this is the case, then your best option for obtaining the optimum match for a signal within a very limited corpus is a Dynamic Time Warping (DTW) based recogniser. Hidden Markov Model based recognition systems are the state-of-the-art, but a DTW based system will be vastly more simple to implement.



回答4:

As others have suggested, unless you can give a lot more info, there is no simple solution. If they are just very short sounds that don't change much over time, one possibility is to do an FFT and compare the results of the FFTs.

For something more complex, you could take a similar approach, but do STFT.

In all likelihood however, there is a domain-specific answer to your question.