Speaker Recognition using MARF

2019-03-31 11:27发布

问题:

I am using MARF(Modular Audio Recognition Framework) to recognize the Speaker's voice. In this, i have trained MARF with the voice of person 'A' and tested MARF with voice of person 'B'. Trained using --train training-samples Tested using --ident testing-samples/G.wav In my speakers.txt file I have mentioned the voice samples of both the persons i.e. A & B.

But I am not getting the correct response means both the trained voice and testing voice are different but MARF is giving the Audio Sampled match.

I have gone through this link too..

http://stackoverflow.com/questions/4837511/speaker-recognition

Result

    Config: [SL: WAVE, PR: NORMALIZATION (100), FE: FFT (301), CL: EUCLIDEAN_DISTANCE (503), ID: -1]
         Speaker's ID: 26
   Speaker identified: G

Or i am doing wrong Or is there any other Speaker recognition method available.

EDIT ------------------------ Now I am using vText and it can be easily used. http://basic-signalprocessing.com/voiceRecognition.php Follow this link and vText is using MATLAB also too give the output.

I am getting the correct freq-time domain graph but I am not able to compare the two voice samples.I am getting error

Exception: com.mathworks.toolbox.javabuilder.MWException: Error using ==> eq
Matrix dimensions must agree.
{??? Error using ==> eq
Matrix dimensions must agree.

Error in ==> recognizePartial10k at 10


} 

anybody having any idea regarding this

回答1:

First thing I'd say is, in my experience, using the FFT algorithm won't give you the best result : try LPC in MARF

Second : MARF assumes what speech people call a "closed set" which means it will always return results even if the speaker is not known to the system -> you'd have to decide the likelihood of the response based on a distance threshold.

Also make sure the sliding window (Hamming window) size is set accordingly to your file's sample rate : e.g. using a window of 512 sampled values for a sample rate of 22050 Hz yields a window of ca. 23 ms which in my experience returned the best results on a data set of 500 speakers.

Since 22050 Hz means that much samples per second, finding the desired length of around 25 ms for any sample rate is easy : sample rate / 1000 * 25

Please note that the FFT algorithm used in MARF requires a window of exactly a power of 2 (256 / 512 / 1024 / ...).

But that's not required for the LPC algorithm (maybe slightly more efficient for the processor though, since powers of 2 is all it knows :-))

Ha, and don't forget that if you're using a stereo file, the window is twice as long... but I would advise to use a mono file : there's no added value in using a multichannel file for voice processing, it's longer and less precise.

A word on sample rate : the selected sample rate should be twice the highest frequency you're interested in. Usually, people consider that the highest frequency for voice is 4000Hz and thus select a sample rate of 8000Hz. Please note that this is not entirely correct : "s" and "sh" sounds reach for higher frequencies. It's true that you don't need those frequencies to understand what the speaker is saying, but when extracting a vocal print, it might be useful to use a broader spectrum. My preference goes to 22050Hz. Some vocal password packages don't allow you to go below 11000 Hz.

A word on bit depth : 8 bits vs 16 bits While the sample rate is the precision regarding time, the bit depth links to the precision of the amplitude. 8 bits gives you 256 values 16 bits gives you 65536 values

Needless to say why you should use 16 bits for vocal biometry :-)

For reference, an audio CD uses 44100Hz / 16 bit

About vText : as I told you earlier, Fourier Transforms (FFT) is not something I've found to be usable on large data sets. It lacks of precision.

Here it looks like something goes wrong when delegating calculations to MathLab. Without the code, imho, it's near to impossible to give you more info.

Don't hesitate to ask for clarification on the things I said, I might take some things for granted and not realize it's not that clear :-)

FWIW, I just wrote a Speaker Recognition tool in Java called Recognito, I believe it's not way better than MARF in terms of recognition capabilities, but it's definitely easier on the user for the initial steps, uses a licensing model which doesn't require your software to be open source, supports calls from multiple concurrent threads.

In case you want to give Recognito a shot : https://github.com/amaurycrickx/recognito