Is there an algorithm that tells the semantic simi

2019-01-05 07:45发布

input: phrase 1, phrase 2

output: semantic similarity value (between 0 and 1), or the probability these two phrases are talking about the same thing

11条回答
劫难
2楼-- · 2019-01-05 08:24

Try SimService, which provides a service for computing top-n similar words and phrase similarity.

查看更多
别忘想泡老子
3楼-- · 2019-01-05 08:27

You might want to check out this paper:

Sentence similarity based on semantic nets and corpus statistics (PDF)

I've implemented the algorithm described. Our context was very general (effectively any two English sentences) and we found the approach taken was too slow and the results, while promising, not good enough (or likely to be so without considerable, extra, effort).

You don't give a lot of context so I can't necessarily recommend this but reading the paper could be useful for you in understanding how to tackle the problem.

Regards,

Matt.

查看更多
祖国的老花朵
4楼-- · 2019-01-05 08:28

I would have a look at statistical techniques that take into consideration the probability of each word to appear within a sentence. This will allow you to give less importance to popular words such as 'and', 'or', 'the' and give more importance to words that appear less regurarly, and that are therefore a better discriminating factor. For example, if you have two sentences:

1) The smith-waterman algorithm gives you a similarity measure between two strings. 2) We have reviewed the smith-waterman algorithm and we found it to be good enough for our project.

The fact that the two sentences share the words "smith-waterman" and the words "algorithms" (which are not as common as 'and', 'or', etc.), will allow you to say that the two sentences might indeed be talking about the same topic.

Summarizing, I would suggest you have a look at: 1) String similarity measures; 2) Statistic methods;

Hope this helps.

查看更多
唯我独甜
5楼-- · 2019-01-05 08:28

Take a look at http://mkusner.github.io/publications/WMD.pdf This paper describes an algorithm called Word Mover distance that tries to uncover semantic similarity. It relies on the similarity scores as dictated by word2vec. Integrating this with GoogleNews-vectors-negative300 yields desirable results.

查看更多
爱情/是我丢掉的垃圾
6楼-- · 2019-01-05 08:34

You might want to check into the WordNet project at Princeton University. One possible approach to this would be to first run each phrase through a stop-word list (to remove "common" words such as "a", "to", "the", etc.) Then for each of the remaining words in each phrase, you could compute the semantic "similarity" between each of the words in the other phrase using a distance measure based on WordNet. The distance measure could be something like: the number of arcs you have to pass through in WordNet to get from word1 to word2.

Sorry this is pretty high-level. I've obviously never tried this. Just a quick thought.

查看更多
叼着烟拽天下
7楼-- · 2019-01-05 08:37

There's a short and a long answer to this.

The short answer:

Use the WordNet::Similarity Perl package. If Perl is not your language of choice, check the WordNet project page at Princeton, or google for a wrapper library.

The long answer:

Determining word similarity is a complicated issue, and research is still very hot in this area. To compute similarity, you need an appropriate represenation of the meaning of a word. But what would be a representation of the meaning of, say, 'chair'? In fact, what is the exact meaning of 'chair'? If you think long and hard about this, it will twist your mind, you will go slightly mad, and finally take up a research career in Philosophy or Computational Linguistics to find the truth™. Both philosophers and linguists have tried to come up with an answer for literally thousands of years, and there's no end in sight.

So, if you're interested in exploring this problem a little more in-depth, I highly recommend reading Chapter 20.7 in Speech and Language Processing by Jurafsky and Martin, some of which is available through Google Books. It gives a very good overview of the state-of-the-art of distributional methods, which use word co-occurrence statistics to define a measure for word similarity. You are not likely to find libraries implementing these, however.

查看更多
登录 后发表回答