keras: how to aggregate over frame-level predictio

I am doing a song genre classification. For each song, I have chopped them into small frames (5s) to generate spectrogram as input features for a neural network and each frame has an associated song genre label.

The data looks like the following:

   name         label   feature
   ....
   song_i_frame1 label   feature_vector_frame1
   song_i_frame2 label   feature_vector_frame2
   ...
   song_i_framek label   feature_vector_framek
   ...

I can get a prediction accuracy for each frame from Keras with no problem. But currently, I do not how to aggregate the prediction results from frame-level to song level with a majority voting since the data fed into the when keras model, their names are lost.

How can I retain the names of each label (for example, the song_i_frame1) in the keras outputs to form an aggregate prediction to the song via majority voting. Or, are there other methods to aggregate to song-level prediction???

I googled around but cannot find an answer to this and would appreciate any insight.

In the dataset each label might be named (ex: 'rock'). To use this with a neural network, this needs to be transformed to an integer (ex: 2), and then to a one-hot-encoding (ex: [0,0,1]). So 'rock' == 2 == [0,0,1]. Your output predictions will be in this one-hot-encoded form. [ 0.1, 0.1, 0.9 ] means that class 2 was predicted, [ 0.9, 0.1, 0.1 ] means class 0 etc. To do this in a reversible way, use sklearn.preprocessing.LabelBinarizer.

There are several ways of combining frame-predictions into an overall prediction. The most common are, in increasing order of complexity:

Majority voting.
Mean/average voting.

Below is an example.

import numpy
from sklearn.preprocessing import LabelBinarizer

labels = [ 'rock', 'jazz', 'blues', 'metal' ] 

binarizer = LabelBinarizer()
y = binarizer.fit_transform(labels)

print('labels\n', '\n'.join(labels))
print('y\n', y)

# Outputs from frame-based classifier. 
# input would be all the frames in one song
# frame_predictions = model.predict(frames)
frame_predictions = numpy.array([
    [ 0.5, 0.2, 0.3, 0.9 ],
    [ 0.9, 0.2, 0.3, 0.3 ],
    [ 0.5, 0.2, 0.3, 0.7 ],
    [ 0.1, 0.2, 0.3, 0.5 ],
    [ 0.9, 0.2, 0.3, 0.4 ],
])

def vote_majority(p):
    voted = numpy.bincount(numpy.argmax(p, axis=1))
    normalized = voted / p.shape[0]
    return normalized

def vote_average(p):
    return numpy.mean(p, axis=0)

maj = vote_majority(frame_predictions)
mean = vote_average(frame_predictions)

genre_maj = binarizer.inverse_transform(numpy.array([maj]))
genre_mean = binarizer.inverse_transform(numpy.array([mean]))
print('majority voting', maj, genre_maj)
print('mean voting', mean, genre_mean)

Output

labels:
 rock
 jazz
 blues
 metal
y:
 [[0 0 0 1]
 [0 1 0 0]
 [1 0 0 0]
 [0 0 1 0]]
majority voting: [0.4 0.  0.  0.6] ['rock']
mean voting: [0.58 0.2  0.3  0.56] ['blues']

One can also perform voting by using a classifier trained on the frame-wise predictions, though this not so commonly done and is complicated when input length varies.

Another alternative is to use Multiple-Instance-Learning with GlobalAveragePooling over the frame-based classifications, to learn on whole songs at once.