Categorizing Words and Category Values

2019-03-09 02:29发布

We were set an algorithm problem in class today, as a "if you figure out a solution you don't have to do this subject". SO of course, we all thought we will give it a go.

Basically, we were provided a DB of 100 words and 10 categories. There is no match between either the words or the categories. So its basically a list of 100 words, and 10 categories.

We have to "place" the words into the correct category - that is, we have to "figure out" how to put the words into the correct category. Thus, we must "understand" the word, and then put it in the most appropriate category algorthmically.

i.e. one of the words is "fishing" the category "sport" --> so this would go into this category. There is some overlap between words and categories such that some words could go into more than one category.

If we figure it out, we have to increase the sample size and the person with the "best" matching % wins.

Does anyone have ANY idea how to start something like this? Or any resources ? Preferably in C#?

Even a keyword DB or something might be helpful ? Anyone know of any free ones?

21条回答
太酷不给撩
2楼-- · 2019-03-09 02:58

So it seems you have a couple options here, but for the most part I think if you want accurate data you are going to need to use some outside help. Two options that I can think of would be to make use of a dictionary search, or crowd sourcing.

In regards to a dictionary search, you could just go through the database, query it and parse the results to see if one of the category names is displayed on the page. For example, if you search "red" you will find "color" on the page and likewise, searching for "fishing" returns "sport" on the page.

Another, slightly more outside the box option would be to make use of crowd sourcing, consider the following:

  1. Start by more or less randomly assigning name-value pairs.
  2. Output the results.
  3. Load the results up on Amazon Mechanical Turk (AMT) to get feedback from humans on how well the pairs work.
  4. Input the results of the AMT evaluation back into the system along with the random assignments.
  5. If everything was approved, then we are done.
  6. Otherwise, retain the correct hits and process them to see if any pattern can be established, generate a new set of name-value pairs.
  7. Return to step 3.

Granted this would entail some financial outlay, but it might also be one of the simplest and accurate versions of the data you are going get on a fairly easy basis.

查看更多
Root(大扎)
3楼-- · 2019-03-09 02:58

You could do a custom algorithm to work specifically on that data, for instance words ending in 'ing' are verbs (present participle) and could be sports.

Create a set of categorization rules like the one above and see how high an accuracy you get.

EDIT:

Steal the wikipedia database (it's free anyway) and get the list of articles under each of your ten categories. Count the occurrences of each of your 100 words in all the articles under each category, and the category with the highest 'keyword density' of that word (e.g. fishing) wins.

查看更多
Lonely孤独者°
4楼-- · 2019-03-09 02:58

Use an existing categorized large data set such as RCV1 to train your system of choice. You could do worse then to start reading existing research and benchmarks.

Appart from Google there exist other 'encyclopedic" datasets you can build of, some of them hosted as public data sets on Amazon Web Services, such as a complete snapshot of the English language Wikipedia.

Be creative. There is other data out there besides Google.

查看更多
在下西门庆
5楼-- · 2019-03-09 02:59

My attempt would be to use the toolset of CRM114 to provide a way to analyze a big corpus of text. Then you can utilize the matchings from it to give a guess.

查看更多
Fickle 薄情
6楼-- · 2019-03-09 03:00

Interesting problem. What you're looking at is word classification. While you can learn and use traditional information retrieval methods like LSA and categorization based on such - I'm not sure if that is your intent (if it is, then do so by all means! :)

Since you say you can use external data, I would suggest using wordnet and its link between words. For instance, using wordnet,

# S: (n) **fishing**, sportfishing (the act of someone who fishes as a diversion)
* direct hypernym / inherited hypernym / sister term
      o S: (n) **outdoor sport, field sport** (a sport that is played outdoors)
      + direct hypernym / inherited hypernym / sister term
            # S: (n) **sport**, athletics 
            (an active diversion requiring physical exertion and competition) 

What we see here is a list of relationships between words. The term fishing relates to outdoor sport, which relates to sport.

Now, if you get the drift - it is possible to use this relationship to compute a probability of classifying "fishing" to "sport" - say, based on the linear distance of the word-chain, or number of occurrences, et al. (should be trivial to find resources on how to construct similarity measures using wordnet. when the prof says "not to use google", I assume he means programatically and not as a means to get information to read up on!)

As for C# with wordnet - how about http://opensource.ebswift.com/WordNet.Net/

查看更多
对你真心纯属浪费
7楼-- · 2019-03-09 03:05

Really poor answer (demonstrates no "understanding") - but as a crazy stab you could hit google (through code) for (for example) "+Fishing +Sport", "+Fishing +Cooking" etc (i.e. cross join each word and category) - and let the google fight win! i.e. the combination with the most "hits" gets chosen...

For example (results first):

weather: fish
sport: ball
weather: hat
fashion: trousers
weather: snowball
weather: tornado

With code (TODO: add threading ;-p):

static void Main() {
    string[] words = { "fish", "ball", "hat", "trousers", "snowball","tornado" };
    string[] categories = { "sport", "fashion", "weather" };

    using(WebClient client = new WebClient()){
        foreach(string word in words) {
            var bestCategory = categories.OrderByDescending(
                cat => Rank(client, word, cat)).First();
            Console.WriteLine("{0}: {1}", bestCategory, word);
        }
    }
}

static int Rank(WebClient client, string word, string category) {
    string s = client.DownloadString("http://www.google.com/search?q=%2B" +
        Uri.EscapeDataString(word) + "+%2B" +
        Uri.EscapeDataString(category));
    var match = Regex.Match(s, @"of about \<b\>([0-9,]+)\</b\>");
    int rank = match.Success ? int.Parse(match.Groups[1].Value, NumberStyles.Any) : 0;
    Debug.WriteLine(string.Format("\t{0} / {1} : {2}", word, category, rank));
    return rank;
}
查看更多
登录 后发表回答