How to do column wise intersection with itertools

2019-09-19 02:17发布

问题:

When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix.

I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for the same.

Find the sample of the dataset below:

 ID      AGE    Occupation  Gender  Product_range   Product_cat Product

1100    25-34   IT            M       50-60         Gaming      XPS 6610
1101    35-44   Research      M       60-70         Business    Latitude lat6
1102    35-44   Research      M       60-70         Performance Inspiron 5810
1103    25-34   Lawyer        F       50-60         Business    Latitude lat5
1104    45-54   Business      F       40-50         Performance Inspiron 5410

The matrix I get is

Problem Statement:

If you see the value under the red box that shows the similarity of row (1104) and (1101) of the sample dataset. The two rows are not similar if you look at their respective columns, however the value 0.16 is because of the term "Business" present in "occupation" column of row (1104) and "product_cat" column of row(1101), which gives outcome as 1 when the intersection of the rows are taken.

My code just takes the intersection of the two rows without looking at the columns, How do I change my code to handle this case and keep the performance equally good.

My code:

half_matrix=[]
for row1, row2 in itertools.combinations(data_set, r=2):
    intersection_len = row1.intersection(row2)
        half_matrix.append(float(len(intersection_len)) /tot_len) 

回答1:

The simplest way out of this is to add a column-specific prefix to all entries. Example of a parsed row:

row = ["ID:1100", "AGE:25-34", "Occupation:IT", "Gender:M", "Product_range:50-60", "Product_cat:Gaming", "Product:XPS 6610"]

There are many other ways around this, including splitting each row into a set of k-mers and applying the Jaccard-based MinHash algorithm to compare these sets, but there is no need in such a thing in your case.