I am trying to achieve the following:
- Using Skype, call my mailbox (works)
- Enter password and tell the mailbox that I want to record a new welcome message (works)
- Now, my mailbox tells me to record the new welcome message after the beep
- I want to wait for the beep and then play the new message (doesn't work)
How I tried to achieve the last point:
- Create a spectrogram using FFT and sliding windows (works)
- Create a "finger print" for the beep
- Search for that fingerprint in the audio that comes from skype
The problem I am facing is the following:
The result of the FFTs on the audio from skype and the reference beep are not the same in a digital sense, i.e. they are similar, but not the same, although the beep was extracted from an audio file with a recording of the skype audio. The following picture shows the spectrogram of the beep from the Skype audio on the left side and the spectrogram of the reference beep on the right side. As you can see, they are very similar, but not the same...
uploaded a picture http://img27.imageshack.us/img27/6717/spectrogram.png
I don't know, how to continue from here. Should I average it, i.e. divide it into column and rows and compare the averages of those cells as described here? I am not sure this is the best way, because he already states, that it doesn't work very good with short audio samples, and the beep is less than a second in length...
Any hints on how to proceed?
You should determine the peak frequency and duration (possibly a minumum power over that duration for the frequency (RMS being the simplest measure)
This should be easy enough to measure. To make things even more clever (but probably completely unnecessary for this simple matching task), you could assert the non-existance of other peaks during the window of the beep.
Update
To compare a complete audio fragment, you'll want to use a Convolution algorithm. I suggest using a ready made library implementation instead of rolling your own.
Wikipedia lists http://freeverb3.sourceforge.net as an open source candidate
Edit Added link to API tutorial page: http://freeverb3.sourceforge.net/tutorial_lib.shtml
Additional resources:
http://en.wikipedia.org/wiki/Finite_impulse_response
http://dspguru.com/dsp/faqs/fir
Existing packages with relevant tools on debian:
First I'd smooth it a bit in frequency-direction so that small variations in frequency become less relevant. Then simply take each frequency and subtract the two amplitudes. Square the differences and add them up. Perhaps normalize the signals first so differences in total amplitude don't matter. And then compare the difference to a threshold.