Feature selection for Named entity using SVM

2020-03-26 02:35发布

I have some user comments data from which I want to find the name of consumer electronic brands. For instance consider these ne_chinked example sentence which talk about "PS4", "nokia 720 lumia" ,"apple ipad", "sony bravia":-

In [52]: nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize('When is the PS4 releasing')))
Out[52]: Tree('S', [('When', 'WRB'), ('is', 'VBZ'), ('the', 'DT'), Tree('ORGANIZATION', [('PS4', 'NNP')]), ('releasing', 'NN')])

In [53]: nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize('I couldnt find the nokia 720 lumia in stores')))
Out[53]: Tree('S', [('I', 'PRP'), ('couldnt', 'VBP'), ('find', 'JJ'), ('the', 'DT'), ('nokia', 'NN'), ('720', 'CD'), ('lumia', 'NN'), ('in', 'IN'), ('stores', 'NNS')])

In [54]: nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize('I just bought apple ipad and its really awesome')))
Out[54]: Tree('S', [('I', 'PRP'), ('just', 'RB'), ('bought', 'VBD'), ('apple', 'JJ'), ('ipad', 'NN'), ('and', 'CC'), ('its', 'PRP$'), ('really', 'RB'), ('awesome', 'JJ')])

In [55]: nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize('I would like to buy 1 Sony bravia led television')))
Out[55]: Tree('S', [('I', 'PRP'), ('would', 'MD'), ('like', 'VB'), ('to', 'TO'), ('buy', 'VB'), ('1', 'CD'), ('Sony', 'NNP'), ('bravia', 'IN'), ('led', 'VBN'), ('television', 'NN')])  

The problem is how do I represent the data for the svm to learn. I read tens of research papers but none of them have disclosed how they represented the feature data to the svm. Can anybody please help

1条回答
别忘想泡老子
2楼-- · 2020-03-26 02:44

What I would do is add all entries of electronic brands you care in a list, and then in order for each entry to be unique I would use its entry's index in the list as a feature.

e.g. ['Nokia', 'Apple', 'Microsoft']

then: Nokia => 1 Microsoft => 2 etc

This could help having a unique representation per brand as as a result a feature for SVM amongst others I pressume.

查看更多
登录 后发表回答