Products Price Comparison Tool: Difficulty in matc

I'm working on creating an e-comm products price comparison tool(in python) which is somewhat similar to camelcamelcamel.com, both for fun and profit. I'm facing the difficult when I want to match the identical items from the list that I gathered from various websites using a search term. I'm using Cosine similarity and thinking of using Levenshtein's Algorithm for product matching, to match the titles of the various items against each other to find the identical items.

For example, I have the following items and their price values as,

{
    product_0: {
        title: "Apple MacBook Air MMGF2HN/A 13.3-inch Laptop (Core i5/8GB/128GB/Mac OS X/Integrated Graphics)",
        price: "xxxx",
    },
    product_1: {
        title: "Apple MacBook Air MMGF2HN/A 13.3-inch Laptop (Core i5/8GB/128GB/Mac OS X/Integrated Graphics) cover",
        price: "xyzy"
    },
    product_2: {
        title: "Apple Macbook Air MMGF2HNA Notebook (Intel Core i5- 8GB RAM- 128GB SSD- 33.78 cm(13.3)- OS X El Capitan) (Silver)"
        price: "xxyy"
    },
    product_3: {
        title: "....",
        price: "...."
    },

    ...

    product_99: {
        // product title and price
    }

}

When I used cosine similarity on the above list(data) of items, the values are as follows

cosine(product_0 * product_1) = 0.973328526785
cosine(product_0 * product_2) = 0.50251890763

But in reality product_0 and product_1 are two different items but their consine similarity value shows that the items are identical; and product_0 and product_2 are from same entity but their cosine value shows that they are two different items.

I've been struggling to solve this problem on my own, thought I could ask for some suggestion/advice here in stackoverflow. Am I in the right direction using cosine similarity to match the similarities of items?. If not could you please channel me in the right direction.

My basic idea is to do a price comparison on identical items, i.e., Semantic Analysis of various similar product items.

Thanks for your time.

标签： python machine-learning nlp information-retrieval cosine-similarity

1条回答

戒情不戒烟

2楼-- · 2019-04-16 23:37

You could train word2vec on the product titles. Resulting code would look something like this when using the Python word2vec wrapper and slightly different but similar when using Gensim's model.word2vec:

indexes, metrics = model.cosine(normalized_phrase)
model.generate_response(indexes, metrics)

The generated response will be the title vectors sorted by descending cosine similarity.

0人赞添加讨论(0) 举报

Products Price Comparison Tool: Difficulty in matc

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间