How to do column wise intersection with itertools

When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix.

I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for the same.

Find the sample of the dataset below:

 ID      AGE    Occupation  Gender  Product_range   Product_cat Product

1100    25-34   IT            M       50-60         Gaming      XPS 6610
1101    35-44   Research      M       60-70         Business    Latitude lat6
1102    35-44   Research      M       60-70         Performance Inspiron 5810
1103    25-34   Lawyer        F       50-60         Business    Latitude lat5
1104    45-54   Business      F       40-50         Performance Inspiron 5410

The matrix I get is

enter image description here

Problem Statement:

If you see the value under the red box that shows the similarity of row (1104) and (1101) of the sample dataset. The two rows are not similar if you look at their respective columns, however the value 0.16 is because of the term "Business" present in "occupation" column of row (1104) and "product_cat" column of row(1101), which gives outcome as 1 when the intersection of the rows are taken.

My code just takes the intersection of the two rows without looking at the columns, How do I change my code to handle this case and keep the performance equally good.

My code:

half_matrix=[]
for row1, row2 in itertools.combinations(data_set, r=2):
    intersection_len = row1.intersection(row2)
        half_matrix.append(float(len(intersection_len)) /tot_len)

标签： python-2.7 machine-learning cluster-analysis data-mining k-means

1条回答

Deceive 欺骗

2楼-- · 2019-09-19 03:20

The simplest way out of this is to add a column-specific prefix to all entries. Example of a parsed row:

row = ["ID:1100", "AGE:25-34", "Occupation:IT", "Gender:M", "Product_range:50-60", "Product_cat:Gaming", "Product:XPS 6610"]

There are many other ways around this, including splitting each row into a set of k-mers and applying the Jaccard-based MinHash algorithm to compare these sets, but there is no need in such a thing in your case.

0人赞添加讨论(0) 举报

How to do column wise intersection with itertools

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间