I have a collection of clothing / accessory products (represented by a Python object) with various attributes. These products are generated by a combination of querying an external API and scraping the merchant websites to obtain various attributes.
My goal is to develop a classifier that uses these attributes to correctly categorise the products (i.e. into categories such as trousers, t-shirts, dresses etc.).
I have both a training and a test data set which are a subset of the entire data set selected uniformly at random which have been manually categorised.
I spoke to an ex-university colleague of mine who specialises in machine learning and he suggested using a decision tree. However, the decision tree libraries in Python appear to be very numerically focused (rather than focused on classifying data based on textual attributes).
I am aware of libraries like Scikit Learn but from my brief analysis it appears that they generally involve simpler logic for the rules than I require.
Any suggestions on approach, library, code structure etc would be greatly appreciated. However, the main focus of this question is which Python machine learning library (if any) would be most appropriate for this task.
The product attributes include the following:
- name (
str
) - description (
str
) - available_sizes (
[str, str...]
) - available_colours ([str, str...])
- price (
float
) - url (
str
) - category_name (
str
) - images (
[str, str...]
- urls)
An example of a product:
{ 'category': u"Men's Accessories",
'colours': [u'White'],
'description': u'Keep your formal style looking classic with this white short sleeve Oxford shirt with roll up sleeve detailing.',
'ean': u'',
'gender': u'M',
'images': [ u'http://media.topman.com/wcsstore/TopMan/images/catalog/83D21DWHT_large.jpg',
u'http://media.topman.com/wcsstore/TopMan/images/catalog/83D21DWHT_2_large.jpg',
u'http://media.topman.com/wcsstore/TopMan/images/catalog/83D21DWHT_3_large.jpg',
u'http://media.topman.com/wcsstore/TopMan/images/catalog/83D21DWHT_4_large.jpg'],
'last_scraped': datetime.datetime(2014, 11, 1, 7, 13, 28, 943000),
'merchant_id': 2479L,
'merchant_uri': u'http://www.topman.com/en/tmuk/product/white-oxford-short-sleeve-shirt-157702?geoip=noredirect',
'name': u'White Oxford Short Sleeve Shirt',
'price': 26.0,
'sizes': [u'XXS', u'XS', u'S', u'M', u'L', u'XL', u'XXL']}