Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
How could I differentiate between two people speaking? As in if someone says "hello" and then another person says "hello" what kind of signature should I be looking for in the audio data? periodicity?
Thanks a lot to anyone who can answer this!
The solution to this problem lies in Digital Signal Processing (DSP). Speaker recognition is a complex problem which brings computers and communication engineering to work hand in hand. Most techniques of speaker identification require signal processing with machine learning (training over the speaker database and then identification using training data). The outline of algorithm which may be followed -
- Record the audio in raw format. This serves as the digital signal which needs to be processed.
- Apply some pre-processing routines over the captured signal. These routines could be simply signal normalization, or filtering the signal to remove noise (using band pass filters for normal frequency range of human voice. Band pass filters can in turn be created using a low pass and a high pass filter in combination.)
- Once it is fairly certain that the captured signal is pretty much free from noise, feature extraction phase begins. Some of the known techniques which are used for extracting voice features are - Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC) or simple FFT features.
- Now, there are two phases - training and testing.
- First the system needs to be trained over the voice features of different speakers before it is capable to distinguish between them. In order to ensure that the features are correctly calculated, it is recommended that several (>10) samples of voice from speakers must be collected for training purposes.
- Training can be done using different techniques like neural networks or distance based classification to find the differences in the features of voices from different speakers.
- In testing phase, the training data is used to find the voice feature set which lies at the lowest distance from the signal being tested. Different distances like Euclidean or Chebyshev distances might be used to calculate this proximity.
There are two open source implementations which enable speaker identification - ALIZE: http://mistral.univ-avignon.fr/index_en.html and MARF: http://marf.sourceforge.net/.
I know its a bit late to answer this question, but I hope someone finds it useful.
This is an extremely hard problem, even for experts in speech and signal processing. This page has much more information: http://en.wikipedia.org/wiki/Speaker_recognition
And some suggested technology starting points:
The various technologies used to
process and store voice prints include
frequency estimation, hidden Markov
models, Gaussian mixture models,
pattern matching algorithms, neural
networks, matrix representation,Vector
Quantization and decision trees. Some
systems also use "anti-speaker"
techniques, such as cohort models, and
world models.
Having only two people to differentiate, if they are uttering the same word or phrase will make this much easier. I suggest starting with something simple, and only adding complexity as needed.
To begin, I'd try sample counts of the digital waveform, binned by time and magnitude or (if you have the software functionality handy) an FFT of the entire utterance. I'd consider a basic modeling process first, too, such as linear discriminant (or whatever you already have available).
Another way to go is to use an array of microphones and differentiate between the postions and directions of the vocal sources. I consider this to be a easier approach since the position calculation is much less complicated than separating different speakers from a mono or stereo source.