When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix.
I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for the same.
Find the sample of the dataset below:
ID AGE Occupation Gender Product_range Product_cat Product
1100 25-34 IT M 50-60 Gaming XPS 6610
1101 35-44 Research M 60-70 Business Latitude lat6
1102 35-44 Research M 60-70 Performance Inspiron 5810
1103 25-34 Lawyer F 50-60 Business Latitude lat5
1104 45-54 Business F 40-50 Performance Inspiron 5410
The matrix I get is
Problem Statement:
If you see the value under the red box that shows the similarity of row (1104) and (1101) of the sample dataset. The two rows are not similar if you look at their respective columns, however the value 0.16 is because of the term "Business" present in "occupation" column of row (1104) and "product_cat" column of row(1101), which gives outcome as 1 when the intersection of the rows are taken.
My code just takes the intersection of the two rows without looking at the columns, How do I change my code to handle this case and keep the performance equally good.
My code:
half_matrix=[]
for row1, row2 in itertools.combinations(data_set, r=2):
intersection_len = row1.intersection(row2)
half_matrix.append(float(len(intersection_len)) /tot_len)