可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have two website and i have datas in my hands now i want to do analysis with that data

I have two product name(Brand + Product name) i want to extract only brand name

http://www.thehut.com/jeans-clothing/men/clothing/brave-soul-men-s-cardiff-jeans-denim/10741907.html

In the above website the product name is

Brave Soul Men's Swansea Jeans - Denim

Brand name is

Brave Soul

So i want only

Brave Soul

Amazon weblink

http://www.amazon.in/gp/product/B00L8WT2UI

Similarly In the above website the product name is

Apple iPhone 5c (White, 8GB)

Brand name is

Apple

So i want output like

Brave Soul
Apple

回答1:

The information you're trying to get isn't actually there.

If you take two strings, both of which may have any number of spaces, and join them together with a space, it's no longer possible to tell unambiguously which space was joining the two strings, and which spaces were part of the strings.

So, you have a few options:

First, there aren't that many spaces in each product, so you can just try all the possibilities: Brand Brave and Product Soul Men's Swansea Jeans - Denim, then Brand Brave Soul and Product Men's Swansea Jeans - Denim, then Brand Brave Soul Men's and Product Swansea Jeans - Denim, and so on for the other 3 possibilities.

Second, if you can scrape a list of all brand names from somewhere else and stash them in a set (or a database table or whatever), you can pre-filter the possibilities before trying them all in comparatively slow web requests to Amazon. For example, if you have a list of all the brands, just check which among Brave, Brave Soul, Brave Soul Men's, Brave Soul Men's Swansea, etc. are actual brands, and only test those.

Meanwhile, this still isn't going to be perfect, because there are almost certainly cases that are ambiguous. For example, there's a brand Apple, and also a brand Apple Records, so what happens when you try to split up Apple Records Master Collection? You've got two valid possibilities, not just one. All you can do is design your code to deal with that in some way (and unit test that you did so correctly).

回答2:

I will not try to suggest concrete algorithms here, but I would just like to point out that even in an unsupervised case, having sufficiently large database of such strings, certain patterns can be discovered (i.e. the information actually is there). First, brand names will repeat in multiple strings and will always be the prefix of the string. That assumption will give a strong prior on where the separation can be. Second, the ambiguous cases (apple records) could be attacked by learning an association between potential brand hypotheses and product names (e.g. as long as apple does not produce records and there is a strong hypothesis that apple records is a brand since it repeats multiple times, this ambiguity can be resolved successfully).