I have two website and i have datas in my hands now i want to do analysis with that data
I have two product name(Brand + Product name) i want to extract only brand name
http://www.thehut.com/jeans-clothing/men/clothing/brave-soul-men-s-cardiff-jeans-denim/10741907.html
In the above website the product name is
Brave Soul Men's Swansea Jeans - Denim
Brand name is
Brave Soul
So i want only
Brave Soul
Amazon weblink
http://www.amazon.in/gp/product/B00L8WT2UI
Similarly In the above website the product name is
Apple iPhone 5c (White, 8GB)
Brand name is
Apple
So i want output like
Brave Soul
Apple
The information you're trying to get isn't actually there.
If you take two strings, both of which may have any number of spaces, and join them together with a space, it's no longer possible to tell unambiguously which space was joining the two strings, and which spaces were part of the strings.
So, you have a few options:
First, there aren't that many spaces in each product, so you can just try all the possibilities: Brand Brave
and Product Soul Men's Swansea Jeans - Denim
, then Brand Brave Soul
and Product Men's Swansea Jeans - Denim
, then Brand Brave Soul Men's
and Product Swansea Jeans - Denim
, and so on for the other 3 possibilities.
Second, if you can scrape a list of all brand names from somewhere else and stash them in a set
(or a database table or whatever), you can pre-filter the possibilities before trying them all in comparatively slow web requests to Amazon. For example, if you have a list of all the brands, just check which among Brave
, Brave Soul
, Brave Soul Men's
, Brave Soul Men's Swansea
, etc. are actual brands, and only test those.
Meanwhile, this still isn't going to be perfect, because there are almost certainly cases that are ambiguous. For example, there's a brand Apple
, and also a brand Apple Records
, so what happens when you try to split up Apple Records Master Collection
? You've got two valid possibilities, not just one. All you can do is design your code to deal with that in some way (and unit test that you did so correctly).
I will not try to suggest concrete algorithms here, but I would just like to point out that even in an unsupervised case, having sufficiently large database of such strings, certain patterns can be discovered (i.e. the information actually is there). First, brand names will repeat in multiple strings and will always be the prefix of the string. That assumption will give a strong prior on where the separation can be. Second, the ambiguous cases (apple records) could be attacked by learning an association between potential brand hypotheses and product names (e.g. as long as apple does not produce records and there is a strong hypothesis that apple records is a brand since it repeats multiple times, this ambiguity can be resolved successfully).