I have access to a dataframe of 100 persons and how they performed on a certain motion test. This frame contains about 25,000 rows per person since the performance of this person is kept track of (approximately) each centisecond (10^-2). We want to use this data to predict a binary y-label, that is to say, if someone has a motor problem or not.
The columns and some values of the dataset are follows:
'Person_ID', 'time_in_game', 'python_time', 'permutation_game, 'round', 'level', 'times_level_played_before', 'speed', 'costheta', 'y_label', 'gender', 'age_precise', 'ax_f', 'ay_f', 'az_f', 'acc', 'jerk'
1, 0.25, 1.497942e+09, 2, 1, 'level_B', 1, 0.8, 0.4655, 1, [...]
I reduced the dataset to only 480 rows per person, by just using the row at each half of a second.
Now I want to use a recurrent neural network to predict the binary y_label.
This code extracts the costheta feature used for the input data X and the y-label for output Y.
X = []
Y = []
for ID in person_list:
person_frame = df.loc[df['Person_ID'] == Person_ID]
# costheta is a measurement of performance
coslist = list(person_frame['costheta'])
# extract y-label
score = list(person_frame['y_label'].head(1))[0]
X.append(coslist)
Y.append(binary)
I splitted the data in to training and testing data using a 0.2 test split. Then I tried to create the RNN with Keras as follows:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
embedding_size=32
model=Sequential()
# different_input_values are the set of possible input values
model.add(Embedding(different_input_values, embedding_size, input_length=480))
model.add(LSTM(1000))
# output is binary
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
At last, I began training with this code:
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
batch_size = 64
num_epochs = 100
X_valid, y_valid = X_train[:batch_size], Y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], Y_train[batch_size:]
model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs).
However, the gained accuracy is really low. Depending on the batch size it varies between 0.4 and 0.6.
12/12 [==============================] - 13s 1s/step - loss: 0.6921 - acc: 0.7500 - val_loss: 0.7069 - val_acc: 0.4219
My question is, in general, with complicated data like this, how does one efficiently train a RNN. Should one refrain from reducing the data to 480 rows per person and keep it around 25,000 rows per? Could multiple metrics, such as acc
(acceleration in game) and jerk
cause a significant accuracy gain? What are significant improvements that one could change and consider?