I am creating a pitch detection program that extracts the fundamental frequency from the power spectrum obtained from the FFT of a frame. This is what I have so far:
- divide input audio signal into frames.
- multiply frame with a Hamming window
- compute the FFT and magnitude of the frame sqrt(real^2 + img^2)
- find the fundamental frequency (peak) by harmonic product spectrum
- convert the frequency of the peak (bin frequency) to note (e. g. ~440 Hz is A4)
Now the program produces an integer with value from 0 to 87 for each frame. Each integer corresponds to a piano note according to a formula I found here. I am now trying to imitate the melodies in the input signal by synthesizing sounds based on the calculated notes. I tried to simply generate a sine wave with magnitude and frequency corresponding to the fundamental frequency but the result sounded nothing like the original sound (almost sounded like random beeps).
I don't really understand music so based on what I have, can I generate a sound with melodies similar to the input (instrument, voice, instrument + voice) based on the information I get from the fundamental frequency? If not, what other ideas can I try using the code I currently have.
Thanks!
Your method might work for synthetic music using notes synchronized to fit your fft frame timing and length, and using only note sounds whose complete spectrum is compatible with your HPS pitch estimator. None of that is true for common music.
For the more general case, automatic music transcription still seems to be a research problem, with no simple 5 step solution. Pitch is a human psycho-acoustic phenomena. People will hear notes that may or may not be present in the local spectrum. The HPS pitch estimation algorithm is much more reliable than using the FFT peak, but can still fail for many kinds of musical sounds. Also, the FFT of any frames that cross note boundaries or transients may contain no clear single pitch to estimate.
Your approach will not work for any general musical example, for the following reasons:
Music by its very nature is dynamic. Meaning that every sound present in music is modulated by distinct periods of silence, attack, sustain, decay, and again silence, otherwise known as the envelope of the sound.
Musical instrument notes and human vocal notes cannot be properly synthesized by a single tone. These notes must be synthesized by a fundamental tone and many harmonics.
However, it is not sufficient to synthesize the fundamental tone and the harmonics of a musical instrument note or vocal note, one must also synthesize the envelope of the note, as described in 1 above.
Furthermore, to synthesize a melodic passage in music, whether instrumental or vocal, one must synthesize items 1-3 above, for every note in the passage, and one must also synthesize the timing of every note relative to the beginning of the passage.
Analytically extracting individual instruments or human voices from a final mix recording is a very difficult problem, and your approach doesn't address that problem, so your approach cannot properly address issues 1-4.
In short, any approach that attempts to extract a near perfect musical transcription from the final mix of a musical recording, by using strict analytical methods, is at worst almost certainly doomed to failure, and at best falls in the realm of advanced research.
How to proceed from this impasse depends on what is the purpose of the work, something that the OP didn't mention.
Will this work be used in a commercial product, or is it a hobby project?
If a commercial work, various further approaches are warranted (costly or very costly ones), but the details of those approaches depend on what are the goals of the work.
As a closing note, your synthesis sounds like random beeps due to the following:
Your fundamental tone detector is tied to the timing of your rolling FFT frames, which in effect generates a probably fake fundamental tone at the start-time of each and every rolling FFT frame.
Why are the detected fundamental tones probably fake? Because you're arbitrarily clipping the musical sample into (FFT) frames, and are therefore probably truncating many concurrently sounding notes somewhere mid-note, thereby distorting the spectral signatures of the notes.
You're not trying to synthesize the envelopes of the detected notes, nor can you, because there's no way to obtain envelope information based on your analysis.
Therefore, the synthesized result is probably a series of pure sine chirps, spaced in time by the rolling FFT frame's delta-t. Each chirp may be of a different frequency, with a different envelope magnitude, and with envelopes that are probably rectangular in shape.
To see the complex nature of musical notes, take a look at these references:
Musical instrument spectra to 102.4 KHz
Musical instrument note spectra and their time-domain envelopes
Observe in particular the many pure tones that make up each note, and the complex shape of the time-domain envelope of each note. The variable timing of multiple notes relative to each other is an additional essential aspect of music, as is polyphony (multiple voices sounding concurrently) in typical music.
All of these elements of music conspire to make the strict analytical approach to autonomous musical transcription, extremelly challenging.
It depends greatly on the musical content you want to work with - extracting the pitch of a monophonic recording (i.e. single instrument or voice) is not the same as extracting the pitch of a single instrument from a polyphonic mixture (e.g. extracting the pitch of the melody from a polyphonic recording).
For monophonic pitch extraction there are various algorithm you could try to implement both in the time domain and frequency domain. A couple of examples include Yin (time domain) and HPS (frequency domain), link to further details on both are provided in wikipedia:
However, neither will work well if you want to extract the melody from polyphonic material. Melody extraction from polyphonic music is still a research problem, and there isn't a simple set of steps you can follow. There are some tools out there provided by the research community that you can try out (for non-commercial use only though), namely:
As a final note, when synthesizing your output I'd recommend synthesizing the continuous pitch curve that you extract (the easiest way to do this is to estimate the pitch every X ms (e.g. 10) and synthesize a sine wave that changes frequency every 10 ms, ensuring continuous phase). This will make your result sound a lot more natural, and you avoid the extra error involved in quantizing a continuous pitch curve into discrete notes (which is another problem in its own).
You probably don't want to be picking peaks from a FFT to calculate the pitch. You probably want to use autocorrelation. I wrote up a long answer to a very similar question here: Cepstral Analysis for pitch detection