Multilabel, Multiclass accuracy : how to calculate

I am working on a multilabel and multiclass classification framework, I want to add matrices for multilabel and multiclass accuracy calculation.

Here is demo data :

predicted_labels = [[1,0,0,0,1],[1,0,0,0,1],[1,0,0,0,1],[1,0,0,0,1],[1,0,0,0,1],[1,0,1,0,1]]
true_labels      = [[1,1,0,0,1],[1,0,0,1,1],[1,0,0,0,1],[1,1,1,0,1],[1,0,0,0,1],[1,0,0,0,1]]

Most popular accuracy matrices for multi-label, multi-class classification are :

Hamming score
Hamming loss
Subset accuracy

The code for the above three is :

def hamming_score(y_true, y_pred, normalize=True, sample_weight=None):
    '''
    Compute the Hamming score (a.k.a. label-based accuracy) for the multi-label case

    '''
    acc_list = []
    for i in range(y_true.shape[0]):
        set_true = set( np.where(y_true[i])[0] )
        set_pred = set( np.where(y_pred[i])[0] )
        #print('\nset_true: {0}'.format(set_true))
        #print('set_pred: {0}'.format(set_pred))
        tmp_a = None
        if len(set_true) == 0 and len(set_pred) == 0:
            tmp_a = 1
        else:
            tmp_a = len(set_true.intersection(set_pred))/\
                    float( len(set_true.union(set_pred)) )
        #print('tmp_a: {0}'.format(tmp_a))
        acc_list.append(tmp_a)

    return  { 'hamming_score' : np.mean(acc_list) , 
              'subset_accuracy' : sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None), 
              'hamming_loss' : sklearn.metrics.hamming_loss(y_true, y_pred)}

But I was looking for f1-score for multilabel classification so I tried to use sklearn f1-score :

print(f1_score(demo, true, average='micro'))

But it gave me the error :

> ValueError: multiclass-multioutput is not supported

I converted the data into np array and use f1_score again:

print(f1_score(np.array(true_labels),np.array(predicted_labels), average='micro'))

Then I am getting the accuracy :

0.8275862068965517

I tried one more experiment, I used one-one example from true and predicted labels and used f1-score over that and then took the mean of that :

accuracy_score = []

for tru,pred in zip (true_labels, predicted_labels):
    accuracy_score.append(f1_score(tru,pred,average='micro'))

print(np.mean(accuracy_score))

output:

0.8333333333333335

Accuracy is different

Why it's not working on list of list but working on np array and which method is correct, taking one by one example and mean or using numpy array with all samples?

What other matrices are available for multilabel classification accuracy calculation?

You can check this answer and other answers which is already discussed.