Python decision tree classification of complex obj

2019-08-26 00:29发布

I have a collection of clothing / accessory products (represented by a Python object) with various attributes. These products are generated by a combination of querying an external API and scraping the merchant websites to obtain various attributes.

My goal is to develop a classifier that uses these attributes to correctly categorise the products (i.e. into categories such as trousers, t-shirts, dresses etc.).

I have both a training and a test data set which are a subset of the entire data set selected uniformly at random which have been manually categorised.

I spoke to an ex-university colleague of mine who specialises in machine learning and he suggested using a decision tree. However, the decision tree libraries in Python appear to be very numerically focused (rather than focused on classifying data based on textual attributes).

I am aware of libraries like Scikit Learn but from my brief analysis it appears that they generally involve simpler logic for the rules than I require.

Any suggestions on approach, library, code structure etc would be greatly appreciated. However, the main focus of this question is which Python machine learning library (if any) would be most appropriate for this task.

The product attributes include the following:

  • name (str)
  • description (str)
  • available_sizes ([str, str...])
  • available_colours ([str, str...])
  • price (float)
  • url (str)
  • category_name (str)
  • images ([str, str...] - urls)

An example of a product:

{   'category': u"Men's Accessories",
    'colours': [u'White'],
    'description': u'Keep your formal style looking classic with this white short sleeve Oxford shirt with roll up sleeve detailing.',
    'ean': u'',
    'gender': u'M',
    'images': [   u'http://media.topman.com/wcsstore/TopMan/images/catalog/83D21DWHT_large.jpg',
                  u'http://media.topman.com/wcsstore/TopMan/images/catalog/83D21DWHT_2_large.jpg',
                  u'http://media.topman.com/wcsstore/TopMan/images/catalog/83D21DWHT_3_large.jpg',
                  u'http://media.topman.com/wcsstore/TopMan/images/catalog/83D21DWHT_4_large.jpg'],
    'last_scraped': datetime.datetime(2014, 11, 1, 7, 13, 28, 943000),
    'merchant_id': 2479L,
    'merchant_uri': u'http://www.topman.com/en/tmuk/product/white-oxford-short-sleeve-shirt-157702?geoip=noredirect',
    'name': u'White Oxford Short Sleeve Shirt',
    'price': 26.0,
    'sizes': [u'XXS', u'XS', u'S', u'M', u'L', u'XL', u'XXL']}

1条回答
Fickle 薄情
2楼-- · 2019-08-26 01:02

You can use scikit-learn, but you need to preprocess your data. Other implementations of decision trees can deal with categorical data directly, that will not solve your problems however. You still need to preprocess the data.

First, I would leave out the images, as using them is somewhat complex. For all the other variables, you need to encode them in a way that is sensible for machine learning. For example the available sizes could be encoded as a 0 or 1 depending on whether a given size is available. The colors could be encoded as a categorical if they come from a fixed set of strings. If this is a free text field, using a categorical might not be great (for example people might be using gray and grey, which would be two completely unrelated values, or have typos, etc.)

The descriptions and names are probably unique to each product, so using categorical variables there doesn't make sense, as each one will only be seen once. For these it would probably be best to encode them using a bag of word approach.

You can find a tutorial on text classification in the tutorials section of the scikit-learn documentation. You might want to have a look a the other tutorials, too.

Finally, I would suggest starting with a linear classifier, like Naive Bayes or LinearSVC. Single trees are mostly useful if you want to extract the actual rules, and are rarely used in text processing afaik (there are often tens or hundreds of thousands of features / word, so extracting meaningful rules is hard). If you want to use a tree-based method, using an ensemble like a random forest or gradient boosting will most likely yield better results.

查看更多
登录 后发表回答