I am doing a song genre classification (2 classes). For each song, I have chopped them into small frames (5s) to generate MFCC as input features for a neural network and each frame has an associated song genre label.
The data looks like the following:
name label feature
song_i_frame1 label feature_vector_frame1
song_i_frame2 label feature_vector_frame2
song_i_framek label feature_vector_framek
I know that I can randomly pick say 80% of songs (their small frames) as training data and the rest as testing. But now the way I write X_train is a frame at the frame level and biney cross-entropy loss function is defined at the frame level. I am wondering how I can customize the loss function such that it is minimized over the aggregation (e.g. majority vote of each frame prediction of the song) of frame level prediction.
currently, what I have is:
model_19mfcc = Model(input_shape = (X_train19.shape[1], X_train19.shape[2]))
model_19mfcc.compile(loss='binary_crossentropy', optimizer="RMSProp", metrics=["accuracy"])
history_fit = model_19mfcc.fit(X_train19, y_train,validation_split=0.25, batch_size = 1800/50, epochs= 200)
Also, when I feed into the training and testing data into keras the corresponding ID (name) of the data is lost, is keeping the data (name, lebel, and feature) in a separate pandas dataframe and matching back the prediction from keras a good practice? or are there other good alternatives?
Thanks in advance!
A customized loss function is usually not needed for genre classification. A combined model a song split into multiple prediction windows can be setup with Multiple Instance Learning (MIL).
MIL is a supervised learning approach where the label not on each independent sample (instances), but instead of a "bag" (unordered set) of instances. In your case the instance is each 5 second window of MFCC features, and the bag is the entire song.
In Keras we use
layer to execute our model for all windows. Then we combine the result usingGlobalAveragePooling1D
, effectively implementing mean-voting across the windows. This is more easily differentiable than majority voting.Below is a runnable example:
The example outputs the inner and combined model summaries:
And the shape of the feature vector fed to the model:
8 songs, 23 windows each, with 13 MFCC bands, 216 frames in each window. And a fifth dimension sized 1 to make Keras happy...