Categorizing Words and Category Values

2019-03-09 02:29发布

We were set an algorithm problem in class today, as a "if you figure out a solution you don't have to do this subject". SO of course, we all thought we will give it a go.

Basically, we were provided a DB of 100 words and 10 categories. There is no match between either the words or the categories. So its basically a list of 100 words, and 10 categories.

We have to "place" the words into the correct category - that is, we have to "figure out" how to put the words into the correct category. Thus, we must "understand" the word, and then put it in the most appropriate category algorthmically.

i.e. one of the words is "fishing" the category "sport" --> so this would go into this category. There is some overlap between words and categories such that some words could go into more than one category.

If we figure it out, we have to increase the sample size and the person with the "best" matching % wins.

Does anyone have ANY idea how to start something like this? Or any resources ? Preferably in C#?

Even a keyword DB or something might be helpful ? Anyone know of any free ones?

21条回答
兄弟一词,经得起流年.
2楼-- · 2019-03-09 03:11

My naive approach:

  1. Create a huge text file like this (read the article for inspiration)
  2. For every word, scan the text and whenever you match that word, count the 'categories' that appear in N (maximum, aka radio) positions left and right of it.
  3. The word is likely to belong in the category with the greatest counter.
查看更多
放荡不羁爱自由
3楼-- · 2019-03-09 03:13

Use (either online, or download) WordNet, and find the number of relationships you have to follow between words and each category.

The star\"
4楼-- · 2019-03-09 03:15

First of all you need sample text to analyze, to get the relationship of words. A categorization with latent semantic analysis is described in Latent Semantic Analysis approaches to categorization.

A different approach would be naive bayes text categorization. Sample text with the assigned category are needed. In a learning step the program learns the different categories and the likelihood that a word occurs in a text assigned to a category, see bayes spam filtering. I don't know how well that works with single words.

查看更多
一夜七次
5楼-- · 2019-03-09 03:16

I am assuming that the problem allows using external data, because otherwise I cannot conceive of a way to deduce the meaning from words algorithmically.

Maybe something could be done with a thesaurus database, and looking for minimal distances between 'word' words and 'category' words?

Juvenile、少年°
6楼-- · 2019-03-09 03:17

Google is forbidden, but they have almost a perfect solution - Google Sets.

Because you need to unterstand the semantics of the words you need external datasources. You could try using WordNet. Or you could maybe try using Wikipedia - find the page for every word (or maybe only for the categories) and look for other words appearing on the page or linked pages.

Juvenile、少年°
7楼-- · 2019-03-09 03:18

You might be able to put use the WordNet database, create some metric to determine how closely linked two words (the word and the category) are and then choose the best category to put the word in.

登录 后发表回答