Why normalizing labels in MxNet makes accuracy clo

I am training a model using multi-label logistic regression on MxNet (gluon api) as described here: multi-label logit in gluon My custom dataset has 13 features and one label of shape [,6]. My features are normalized from original values to [0,1] I use simple dense neural net with 2 hidden layers.

I noticed when I don't normalize labels (which take discrete values of 1,2,3,4,5,6 and are purely my choice to map categorical values to these numbers), my training process slowly converges to some minima for example:

Epoch: 0, ela: 8.8 sec, Loss: 1.118188, Train_acc 0.5589, Test_acc 0.5716
Epoch: 1, ela: 9.6 sec, Loss: 0.916276, Train_acc 0.6107, Test_acc 0.6273
Epoch: 2, ela: 10.3 sec, Loss: 0.849386, Train_acc 0.6249, Test_acc 0.6421
Epoch: 3, ela: 9.2 sec, Loss: 0.828530, Train_acc 0.6353, Test_acc 0.6304
Epoch: 4, ela: 9.3 sec, Loss: 0.824667, Train_acc 0.6350, Test_acc 0.6456
Epoch: 5, ela: 9.3 sec, Loss: 0.817131, Train_acc 0.6375, Test_acc 0.6455
Epoch: 6, ela: 10.6 sec, Loss: 0.815046, Train_acc 0.6386, Test_acc 0.6333
Epoch: 7, ela: 9.4 sec, Loss: 0.811139, Train_acc 0.6377, Test_acc 0.6289
Epoch: 8, ela: 9.2 sec, Loss: 0.808038, Train_acc 0.6381, Test_acc 0.6484
Epoch: 9, ela: 9.2 sec, Loss: 0.806301, Train_acc 0.6405, Test_acc 0.6485
Epoch: 10, ela: 9.4 sec, Loss: 0.804517, Train_acc 0.6433, Test_acc 0.6354
Epoch: 11, ela: 9.1 sec, Loss: 0.803954, Train_acc 0.6389, Test_acc 0.6280
Epoch: 12, ela: 9.3 sec, Loss: 0.803837, Train_acc 0.6426, Test_acc 0.6495
Epoch: 13, ela: 9.1 sec, Loss: 0.801444, Train_acc 0.6424, Test_acc 0.6328
Epoch: 14, ela: 9.4 sec, Loss: 0.799847, Train_acc 0.6445, Test_acc 0.6380
Epoch: 15, ela: 9.1 sec, Loss: 0.795130, Train_acc 0.6454, Test_acc 0.6471

However, when I normalize labels and train again I get this wired result showing 99.99% accuracy on both training and testing:

Epoch: 0, ela: 12.3 sec, Loss: 0.144049, Train_acc 0.9999, Test_acc 0.9999
Epoch: 1, ela: 12.7 sec, Loss: 0.023632, Train_acc 0.9999, Test_acc 0.9999
Epoch: 2, ela: 12.3 sec, Loss: 0.013996, Train_acc 0.9999, Test_acc 0.9999
Epoch: 3, ela: 12.7 sec, Loss: 0.010092, Train_acc 0.9999, Test_acc 0.9999
Epoch: 4, ela: 12.7 sec, Loss: 0.007964, Train_acc 0.9999, Test_acc 0.9999
Epoch: 5, ela: 12.6 sec, Loss: 0.006623, Train_acc 0.9999, Test_acc 0.9999
Epoch: 6, ela: 12.6 sec, Loss: 0.005700, Train_acc 0.9999, Test_acc 0.9999
Epoch: 7, ela: 12.4 sec, Loss: 0.005026, Train_acc 0.9999, Test_acc 0.9999
Epoch: 8, ela: 12.6 sec, Loss: 0.004512, Train_acc 0.9999, Test_acc 0.9999

How is this possible? Why normalizing labels affects training accuracy in such way?

The tutorial you linked to does multiclass classification. In multilabel classification, label for an example is a one-hot array. For example label [0 0 1 0] means this example belongs to class 2 (assuming classes start with 0). Normalizing this vector does not make sense because the values are already between 0 and 1. Also, in multiclass classification, only one of the label can be true and the other have to be false. Values other than 0 and 1 do not make sense in multi class classification.

When representing a batch of examples, it is common to write the labels as integers instead of on-hot arrays for easier readability. For example label [4 6 1 7] means the first example belongs to class 4, the second example belongs to class 6 and so on. Normalizing this representation also does not make sense because this representation is internally converted to one hot array.

Now, if you normalize the second representation, the behavior is undefined because floating points cannot be array indices. It is possible something weird is happening to give you the 99% accuracy. Maybe you normalized the values to 0 to 1 and the resulting one-hot arrays mostly points to class 0 and rarely class 1. That could give you a 99% accuracy.

I would suggest to not normalize the labels.