I am doing a song genre classification (2 classes). For each song, I have chopped them into small frames (5s) to generate MFCC as input features for a neural network and each frame has an associated song genre label.
The data looks like the following:
name label feature
....
song_i_frame1 label feature_vector_frame1
song_i_frame2 label feature_vector_frame2
...
song_i_framek label feature_vector_framek
...
I know that I can randomly pick say 80% of songs (their small frames) as training data and the rest as testing. But now the way I write X_train is a frame at the frame level and biney cross-entropy loss function is defined at the frame level. I am wondering how I can customize the loss function such that it is minimized over the aggregation (e.g. majority vote of each frame prediction of the song) of frame level prediction.
currently, what I have is:
model_19mfcc = Model(input_shape = (X_train19.shape[1], X_train19.shape[2]))
model_19mfcc.compile(loss='binary_crossentropy', optimizer="RMSProp", metrics=["accuracy"])
history_fit = model_19mfcc.fit(X_train19, y_train,validation_split=0.25, batch_size = 1800/50, epochs= 200)
Also, when I feed into the training and testing data into keras the corresponding ID (name) of the data is lost, is keeping the data (name, lebel, and feature) in a separate pandas dataframe and matching back the prediction from keras a good practice? or are there other good alternatives?
Thanks in advance!
A customized loss function is usually not needed for genre classification. A combined model a song split into multiple prediction windows can be setup with Multiple Instance Learning (MIL).
MIL is a supervised learning approach where the label not on each independent sample (instances), but instead of a "bag" (unordered set) of instances. In your case the instance is each 5 second window of MFCC features, and the bag is the entire song.
In Keras we use
TimeDistributed
layer to execute our model for all windows. Then we combine the result usingGlobalAveragePooling1D
, effectively implementing mean-voting across the windows. This is more easily differentiable than majority voting.Below is a runnable example:
The example outputs the inner and combined model summaries:
And the shape of the feature vector fed to the model:
8 songs, 23 windows each, with 13 MFCC bands, 216 frames in each window. And a fifth dimension sized 1 to make Keras happy...