I am working on a simple internet radio app in iOS with a very simple speech/music discrimination. The main idea is a radio which plays a signal from url and at the same time it checks what type of signal is being broadcast. When it detects a speech it change the channel and so on.
I wrote a simple iOS app using storyboards and AVFoundation for Player. I have a problem with implementation of a speech detection. I wrote a Matlab code for an algorithm, but I'm not sure how to do it in Xcode.
clear all
close all
[s, fs] = audioread('nagranie.wav');
length = length(s)/fs;
lengthofframe20ms = 0.2*fs;
numberofframes20ms = round(length(s)/lengthofframe20ms);
s1 = zeros(lengthofframe20ms*numberofframes20ms,1);
for i=1:1:length(s(:,1))
s1(i,1)=s(i,1);
end
frame20ms=zeros(numberofframes20ms,lengthofframe20ms);
for i=1:1:numberofframes20ms
for j=1:1:lengthofframe20ms
frame20ms(i,j)=s1(j+3200*(i-1),1);
end
end
lengthofframe260ms = 2.6*fs;
numberofframes260ms = round(length(s)/lengthofframe260ms);
s2 = zeros(lengthofframe260ms*numberofframes260ms,1);
for i=1:1:length(s(:,1))
s2(i,1)=s(i,1);
end
frame260ms=zeros(numberofframes260ms,lengthofframe260ms);
for i=1:1:numberofframes260ms
for j=1:1:lengthofframe20ms
frame260ms(i,j)=s1(j+41600*(i-1),1);
end
end
En = zeros(numberofframes20ms,1);
for i=1:1:numberofframes20ms
L=length(frame20ms(i,:));
En(i)=(norm(frame20ms(i,:))^2)/L;
end
Ek = zeros(numberofframes260ms,1);
for i=1:1:numberofframes260ms
L=length(frame260ms(i,:));
Ek(i)=(norm(frame260ms(i,:))^2)/L;
end
sumN = 0;
for i=1:1:length(En)
sumN=sumN+En(i);
end
sumK = 0;
for i=1:1:length(Ek)
sumK=sumK+Ek(i);
end
EnP = zeros(numberofframes20ms,1);
for i=1:1:numberofframes20ms
EnP(i)=((En(i))/sumK);
end
treshold = 0.5;
lambda=treshold*sumN;
M=numberofframes20ms/numberofframes260ms;
coff=zeros(numberofframes20ms,1);
for i=1:1:numberofframes20ms
if (En(i)<lambda)
for k=1:1:numberofframes260ms
if (((k-1)*M+1)<i) && (i<k*M)
coff(i)=1;
end
end
end
end
As you can see first of all we have to divide signal into 20ms frame and bigger 260 ms frames, then we calculate energy of a every 20ms frame, do some more math and at the last point we check the conditions, when it fits it's speech and when it doesn't frame is sorted as music.
I don't know how to start doing the discrimination part. Which frameworks should I use? I think it can't be really hard, becouse it took me like 20 minutes to write it in MatLab. :)
This is how my app play radio stations:
{
RadioInfo *sharedRadio = [RadioInfo sharedRadio];
NSString *program = [NSString stringWithFormat:@"%@",sharedRadio.list[value]];
NSURL *url = [NSURL URLWithString:program];
AVPlayerItem *playerItem = [AVPlayerItem playerItemWithURL:url];
self.playerItem = [AVPlayerItem playerItemWithURL:url];
self.player = [AVPlayer playerWithPlayerItem:playerItem];
self.player = [AVPlayer playerWithURL:url];
[self.player pause];
[self.player play];
}
This is my first post here so please be kind. I will appreciate and help. I'm stuck on this part.