I am doing a song genre classification. For each song, I have chopped them into small frames (5s) to generate spectrogram as input features for a neural network and each frame has an associated song genre label.
The data looks like the following:
name label feature
....
song_i_frame1 label feature_vector_frame1
song_i_frame2 label feature_vector_frame2
...
song_i_framek label feature_vector_framek
...
I can get a prediction accuracy for each frame from Keras with no problem. But currently, I do not how to aggregate the prediction results from frame-level to song level with a majority voting since the data fed into the when keras model, their names are lost.
How can I retain the names of each label (for example, the song_i_frame1) in the keras outputs to form an aggregate prediction to the song via majority voting. Or, are there other methods to aggregate to song-level prediction???
I googled around but cannot find an answer to this and would appreciate any insight.
In the dataset each label might be named (ex:
'rock'
). To use this with a neural network, this needs to be transformed to an integer (ex:2
), and then to a one-hot-encoding (ex:[0,0,1]
). So'rock' == 2 == [0,0,1]
. Your output predictions will be in this one-hot-encoded form. [ 0.1, 0.1, 0.9 ] means that class 2 was predicted, [ 0.9, 0.1, 0.1 ] means class 0 etc. To do this in a reversible way, use sklearn.preprocessing.LabelBinarizer.There are several ways of combining frame-predictions into an overall prediction. The most common are, in increasing order of complexity:
Below is an example.
Output
One can also perform voting by using a classifier trained on the frame-wise predictions, though this not so commonly done and is complicated when input length varies.
Another alternative is to use Multiple-Instance-Learning with GlobalAveragePooling over the frame-based classifications, to learn on whole songs at once.