Cepstral Analysis for pitch detection

2019-01-16 02:54发布

站内文章 / 前沿技术

60 0

祖国的老花朵

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm looking to extract pitches from a sound signal.

Someone on IRC just explained to me how taking a double FFT achieves this. Specifically:

take FFT
take log of square of absolute value (can be done with lookup table)
take another FFT
take absolute value

I am attempting this using vDSP

I can't understand how I didn't come across this technique earlier. I did a lot of hunting and asking questions; several weeks worth. More to the point, I can't understand why I didn't think of it.

I am attempting to achieve this with vDSP library. It looks as though it has functions to handle all of these tasks.

However, I'm wondering about the accuracy of the final result.

I have previously used a technique which scours the frequency bins of a single FFT for local maxima. When it encounters one, it uses a cunning technique (the change in phase since the last FFT) to more accurately place the actual peak within the bin.

I am worried that this precision will be lost with this technique I'm presenting here.

I guess the technique could be used after the second FFT to get the fundamental accurately. But it kind of looks like the information is lost in step 2.

As this is a potentially tricky process, could someone with some experience just look over what I'm doing and check it for sanity?

Also, I've heard there is an alternative technique involving fitting a quadratic over neighbouring bins. Is this of comparable accuracy? If so, I would favour it, as it doesn't involve remembering bin phases.

So, questions:

does this approach makes sense? Can it be improved?
I'm a bit worried about the "log square" component; there seems to be a vDSP function to do exactly that: vDSP_vdbcon. However, there is no indication it precalculates a log-table -- I assume it doesn't, as the FFT function requires an explicit pre-calculation function to be called and passed into it. And this function doesn't.
Is there some danger of harmonics being picked up?
is there any cunning way of making vDSP pull out the maxima, biggest first?
Can anyone point me towards some research or literature on this technique?
the main question: Is it accurate enough? Can the accuracy be improved? I have just been told by an expert that the accuracy IS INDEED not sufficient. Is this the end of the line?

PS I get SO annoyed when I want to create tags, but cannot. :| I have suggested to the maintainers that SO keep track of attempted tags, but I'm sure I was ignored. We need tags for vDSP, accelerate framework, cepstral analysis

回答1:

Okay, let's go through one by one:

I'm looking to extract pitches from a sound signal.

Although I am not an expert and have had minimal formal training, I think I know the best answer to this problem. I've done a lot of searching, reading, and experimenting over the past few years. My consensus is that the autocorrelation method is by far the best pitch detector in terms of the tradeoff between accuracy, complexity, noise robustness, and speed. Unless you have some very specific circumstances, I would almost always recommend using autocorrelation. More on this later, let me answer your other questions.

What you describe is "cepstral analysis" which is a method mainly used for the extraction of pitch from speech. Cepstral analysis relies entirely on the plentifulness and strength of the overtones of your signal. If for example, you were to pass a pure sine wave through cepstral analysis, you would get terrible results. However, for speech, which is a complex signal, there is a large number of overtones. (overtones, by the way, are elements of the signal which are oscillating at multiples of the fundamental frequency i.e. the pitch we perceive). Cepstral analysis can be robust in detecting speech with a missing fundamental frequency. That is, suppose you plotted the function sin(4x)+sin(6x)+sin(8x)+sin(10x). If you look at that, it is clear that it has the same frequency as the function sin(2x). However, if you apply fourier analysis to this function, the bin corresponding to sin(2x) will have zero magnitude. Thus this signal is consider to have a "missing fundamental frequency", because it does not contain the sinusoid of the frequency which we consider it to be. Thus simply picking the biggest peak on the fourier transform will not work on this signal.

I have previously used a technique which scours the frequency bins of a single FFT for local maxima. when it encounters one, it uses a cunning technique (the change in phase since the last FFT) to more accurately place the actual peak within the bin.

What you are describing is the phase vocoder technique to more accurately measure the frequency of a given partial. However, the basic technique of picking out the biggest bin is going to cause you problems if you use a signal with a missing or weak fundamental frequency component.

I am worried that this precision will be lost with this technique I'm presenting here.

First of all, remember that the phase vocoder technique only more accurately measures the frequency of a single partial. It ignores the information contained in the higher partials about the fundamental frequency. Second of all, given a decent FFT size, you can get very good accuracy using peak interpolation. Someone else here has pointed you towards parabolic interpolation. I also would suggest this.

If you parabolically interpolate the FFT of a 4098 sample block of data at 44100 Hz, with a pitch about 440 hz, that will mean it will be between the 40th (430.66 Hz) and 41st (441.430664064) bin. Assuming this paper is approximately correct in the general case, it says parabolic interpolation increases resolution by more than one order of magnitude. This leaves the resolution at at least 1 Hz, which is the threshold of human hearing. In fact, if you use an ideal Gaussian window, parabolic interpolation is exact at the peaks (That's right, exact. remember, however, that you can never use a true Gaussian window, because it extends forever in both directions.) If you are still worried about getting higher accuracy, you can always pad the FFT. This means adding zeros to the end of the FFT before transforming. It works out that this is equivalent to "sinc interpolation" which is the ideal interpolation function for frequency limited signals.

I guess the technique could be used after the second FFT to get the fundamental accurately. But it kind of looks like the information is lost in step 2.

That is correct. The phase vocoder technique relies on the fact that sequential frames are connected and have a specific phase relationship. However, the log magnitude of the FFT of sequential frames does not show the same relationship in terms of phase, thus it would be useless to use this transform for the second FFT.

does this approach makes sense? Can it be improved?

Yes and yes, I will elaborate on the improvement in my bit on autocorrelation at the end.

I'm a bit worried about And the log square component; there seems to be a vDSP function to do exactly that: vDSP_vdbcon however, there is no indication it precalculates a log-table -- I assume it doesn't, as the FFT function requires an explicit pre-calculation function to be called and passed into it. and this function doesn't.

I don't know the specifics of the vDSP library, sorry.

Is there some danger of harmonics being picked up?

In your original phase-vocoder peak picking technique? yes. With the cepstral method? no, not really, the whole point is that it considers all the harmonics to get its frequency estimate. For exmaple, let's say our freqency is 1. Our overtones are 2,3,4,5,6,7,8,9,etc We would have to take out all of the odd harmonics, i.e. leave 2,4,6,8, etc, and remove the fundamental frequency before it would start to be confused with one of its overtones.

is there any cunning way of making vDSP pull out the maxima, biggest first?

Don't know vDSP, but in the general case, you usually just iterate over all of them and keep track of the biggest.

Can anyone point me towards some research or literature on this technique?

The link P. i gave you in a comment seemed like a good one.

Also, this website offers an incredibly in-depth and wonderfully broad explanation of DSP topics, including all sorts of pitch extraction, manipulation, etc, in both a theoretical and practical way. (this is a more general link to an index on the site). I always find myself coming back to it. Sometimes it can be a bit overwhelming if you jump into the middle of it, but you can always follow every explanation back to the basic building blocks.

Now for autocorrelation. Basically the technique is this: You take your (windowed) signal and time delay it different amounts. Find the amount which matches up best with your original signal. That is the fundamental period. It makes a lot of theoretical sense. You are hunting for the repetitive parts of your signal.

In practice, taking the correlation with all these time delayed copies of the signal is slow. It is usually implemented in this way instead (which is mathematically equivalent):

Zero-Pad it to double its original length.Take the FFT. Then replace all the coefficients with their square magnitude, except for the first, which you set to 0. Now take the IFFT. Divide every element by the first one. This gives you the autocorrelation. Mathematically, you are using the circular convolution theorem (look it up), and using zero-padding to convert a linear convolution problem into a circular convolution one, which can be efficiently solved.

However, be careful about picking the peak. For very small delays, the signal will match up with itself very well, simply because it is continuous. (I mean, if you delay it zero, it correlates perfectly with itself) Instead, pick the largest peak after the first zero-crossing. You can parabolically interpolate the autocorrelation function as well just as with other techniques to get much more accurate values.

This by itself will give you very good pitch detection by all criteria However, you might sometimes encounter a problem with pitch halving and pitch doubling. Basically the problem is that if a signal is repetitive every 1 second, it is also repetitive every two seconds. Similarly, if it has a very strong overtone, you might get pitch halving. So the biggest peak might not always be the one you want. A solution to this problem is the MPM algorithm by Phillip McLeod. The idea is this:

Instead of picking the biggest peak, you want to pick the first peak that is large enough to be considered. How do you determine if a peak is large enough to be considered? If it is at least as high as A*the largest peak, where A is some constant. Phillip suggests a value of A around 0.9 I think. Actually the program he wrote, Tartini, allows you to compare several different pitch detection algorithms in real time. I would strongly suggest downloading it and trying it out (it implements Cepstrum, straight autocorrelation, and MPM): (if you have trouble building, try the instructions here.

One last thing I should note is about windowing. In general, any smooth window will do. Hanning window, Hamming window, etc. Hopefully you should know how to window. I would also suggest doing overlapped windows if you want more accurate temporal measurements.

By the way, a cool property of the autocorrelation is that if the frequency is changing linearly through the windowed section you are measuring, it will give you the correct frequency at the center of the window.

One more thing: What I described is called the biased autocorrelation function. This is because for higher time lags, the overlap between the original signal and the time lagged version becomes less and less. For example, if you look at a window of size N which has been delayed N-1 samples, you see that only one sample overlaps. So the correlation at this delay is clearly going to be very close to zero. You can compensate for this, by diving each value of the autocorrelation function by the number of samples overlap to get it. This is called the unbiased autocorrelation. However, in general, you will get worse results with this, as the higher delay values of the autocorrelation are very noisy, as they are based on only a few samples, so it makes sense to weigh them less.

If you're looking for more information, as always, google is your friend. Good search terms: autocorrelation, pitch detection, pitch tracking, pitch extraction, pitch estimation, cepstrum, etc.

回答2:

This is a brief analysis of the Cepstrum used for pitch determination.

First let's examine a synthetic signal.

The plot below shows the Cepstrum of a synthetic steady-state E2 note, synthesized using a typical near-DC component, a fundamental at 82.4 Hz, and 8 harmonics at integer multiples of 82.4 Hz. The synthetic sinusoid was programmed to generate 4096 samples.

Observe the prominent non-DC peak at 12.36. The Cepstrum width is 1024 (the output of the second FFT), therefore the peak corresponds to 1024/12.36 = 82.8 Hz which is very close to 82.4 Hz the true fundamental frequency.

Now let's examine a real acoustical signal.

The plot below shows the Cepstrum of a real acoustic guitar's E2 note. The signal was not windowed prior to the first FFT. Observe the prominent non-DC peak at 542.9. The Cepstrum width is 32768 (the output of the second FFT), therefore the peak corresponds to 32768/542.9 = 60.4 Hz which is fairly far from 82.4 Hz the true fundamental frequency.

The plot below shows the Cepstrum of the same real acoustic guitar's E2 note, but this time the signal was Hann windowed prior to the first FFT. Observe the prominent non-DC peak at 268.46. The Cepstrum width is 32768 (the output of the second FFT), therefore the peak corresponds to 32768/268.46 = 122.1 Hz which is even farther from 82.4 Hz the true fundamental frequency.

The acoustic guitar's E2 note used for this analysis was sampled at 44.1 KHz with a high quality microphone under studio conditions, it contains essentially zero background noise, no other instruments or voices, and no post processing.

This illustrates the significant challenge of using Cepstral analysis for pitch determination in real acoustical signals.

References:

Real audio signal data, synthetic signal generation, plots, FFT, and Cepstral analysis were done here: Musical instrument cepstrum

回答3:

What's wrong with your existing technique that you're interested in a new one? I don't think a cepstrum is going to give you more accurate pitch, if that's the goal. It will, however, help you with suppressed fundamentals. I suppose you could use the cepstrum to get you close, then go back to the first FFT (which I would keep in its original form) and then apply your cunning technique to the bin that the cepstrum guides you to.

As for the quadratic fit, it's referred to in this paper by Ted Knowlton, which came up in another SO question recently, but I've never used it.

I should add that the quadratic fit technique, at least as outlined in the reference from Knowlton, depends on using a rectangular window on the first FFT. As Paul R explained in another of your questions, if you're doing audio processing you should use a Hann or Hamming window on the first FFT. So I guess an overall algorithm could look like:

Take time domain buffer x, make a windowed copy w.
Sx = FFT(x), Sw = FFT(w)
c = Log of square magnitude of Sw
Cx = FFT(c)
Estimate fundamental (and maybe harmonics) using Cx
Use Sw to do cunning phase trick on fundamental (or higher harmonic) bin(s)
And/or use Sx to do quadratic bin fit around fundamental (or higher harmonic)

The (or higher harmonic) note applies if you do indeed have suppressed fundamentals.

And I mentioned this in your other question, but what makes you think the log requires a lookup table? Why not just call the log function? I imagine that the time taken by two FFTs (O(n*logn)) dwarfs any other processing you can do.

回答4:

Cepstrum analysis is a form of homomorphic processing, explained in the book "Discrete-Time Signal Processing" by Oppenheim & Schafer. It was once thought useful for separating out the exciter frequency from a forment envelope (maybe still is, dunno). It seems to work better when given a fairly long window of stationary data.

But Cepstral analysis is not meant for accuracy of frequency estimation. It's actually a lossy form of analysis. But it might be useful at finding the fundamental frequency from a train of harmonics where the fundamental frequency spectral component might be comparatively weak or even missing.

Phase vocoder analysis (not so cunning, as the technique has been around for maybe a half century) is better at frequency estimation for a given peak, assuming you pick the correct peak (not necessarily the strongest one), the peak spectrum is stationary across both fft frames, and the fundamental isn't completely missing from the spectrum.

Quadratic or parabolic interpolation might be a good fit if the transform of your window function resembles a parabola. Sinc interpolation works better with rectangular windows.

回答5:

This answer is meant to be read in addition to Jeremy Salwen's post, and also to answer the question regarding literatures.

First of all it's important to consider what is the signal's periodicity. Whether or not the signal is closer to a fully periodic signal for a given analysis window.

Refer here for a detailed explanation for the term and maths https://en.wikipedia.org/wiki/Almost_periodic_function#Quasiperiodic_signals_in_audio_and_music_synthesis

The short answer is that if for a given analysis window a signal is fully periodic, or if the signal is quasi-periodic and the analysis window is small enough that periodicity is achieved then Autocorrelation is enough for the task. Examples of signals that fulfill these conditions are:

Pure sinusoidal tone
String instruments with long sustains and stable pitch (no vibrato), especially true on the sustain part, not so true on the transients.
Windpipe instruments that are blown long enough.

Example of signals that fail to fulfill these conditions are:

Percussive sounds
String or windpipe instruments that are played with each note only held very short, or changing in a short time
Complex music, or basically combination of multiple instruments that are played with different pitches.

For pitch detection using autocorrelation there is a tutorial on how it is implemented in Praat:

http://www.pinguinorodriguez.cl/blog/pitch-in-praat/ Pitch in Praat A brief explanation of Praat's pitch detection algorithm. This describes the algorithm named 'ac'.
www.fon.hum.uva.nl/paul/praat.html Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. Paul Boersma. IFA Proceedings 17: 97-110.

The paper describes in detail about the use of unbiased autocorrelation (the term as used by Jeremy Salwen) for pitch detection, it also shows that it is superior to biased autocorrelation for pitch detection. Although it notes that the autocorrelation results are only significant up to half of the window size, you don't neet to calculate the latter half.

A biased autocorrelation is done by windowing the signals using a tapering window and then doing the autocorrelation. This reduces the effects of low-frequency modulation (amplitude change at a slow time scale) that is detrimental to the pitch detection, since otherwise parts with larger amplitude will give a larger autocorrelation coefficient that will be preferred.

The algorithm used in Boersma's paper can be described in 5 steps:

Remove DC from the signal that is going to be windowed (x - x_avg)
Window the signal using a taper function (He argues that Hann window, or better, Gaussian window is used for it)
Autocorrelates the signal
Divide the autocorrelation function with the autocorrelation of the window used.
Peak-picking (similar to previous algorithms)

It's important to note that the window will go toward zero on both ends, and the autocorrelation of the window will also go towards zero. This is why the latter half of an unbiased autocorrelation is useless, it is a division by zero nearing the end of the window.

Next is YIN: - De Cheveigné, Alain, and Hideki Kawahara. "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society of America 111.4 (2002): 1917-1930.

As I understand it the YIN paper also gives evidence that using a taper window has detrimental effects on pitch detection accuracy. And interestingly it prefer to not use any tapering window function (it says something to the effect that tapering window does not bring any improvements to the results and instead complicates it.)

Last is Philip McLeod's SNAC and WSNAC (already linked by Jeremy Salwen):

Philip McLeod, Fast, Accurate Pitch Detection Tools for Music Analysis, PhD thesis, Department of Computer Science, University of Otago, 2008.
McLeod. P, Wyvill. G, "A Smarter Way to Find Pitch", Proc. International Computer Music Conference, Barcelona, Spain, September 5-9, 2005, pp 138-141.
McLeod. P, Wyvill. G, "Visualization of Musical Pitch", Proc. Computer Graphics International, Tokyo, Japan, July 9-11, 2003, pp 300-303.

They can be found on miracle.otago.ac.nz/tartini/papers.html

I haven't read too far into it, but there is a mention of it as a method to reduce the detriment effects of tapering window of biased autocorrelation that is different compared to the method used by Boersma. (note that I haven't come across anything about MPM so I can't say anything about it)

One last suggestion is that if you're making an instrument tuner, the method that would be easier and will have a bit better result compared to autocorrelation is by using cross-correlation with a pure sinusoidal signal with a predetermined frequency.

Jeremy Salwen:

That is, suppose you plotted the function sin(4x)+sin(6x)+sin(8x)+sin(10x). If you look at that, it is clear that it has the same frequency as the function sin(2x). However, if you apply fourier analysis to this function, the bin corresponding to sin(2x) will have zero magnitude. Thus this signal is consider to have a "missing fundamental frequency", because it does not contain the sinusoid of the frequency which we consider it to be.

I would like to argue that although the given signal is periodic at the \omega=2, it is not the same as having the same frequency as the function sin(2x). As fourier analysis will show that the component sin(2x) has zero magnitude. This is related to the point that there is a relation between pitch, frequency and the fundamental frequency of a signal, but they are different and not interchangeable. It is important to remember that pitch is a subjective measurements, that it depends on human as one that perceives it. It looks as though it has the same frequency as sin(2x), that's how we perceive it visually. The same effect also happens similarly on pitch and audio perception. the example that came to mind immediately is Beats, that is the perceived pitch that is heard when there are two sinusoidals with close but different frequencies.