Compare two strings and find how closely they are

Problem:
I have two strings, say, "Billie Jean" and "Thriller". I need to programmatically compare them and find how closely they are related. Those are both songs of the same artist, hence, they should give a higher score (probability, percentage etc) than say, "Brad Pitt" and "Jamaican Farewell".

One way of doing this is an open source Java tool named WikipediaMiner which compares using the Wikipedia data dump, checking links, descriptions etc.

Question:
Please suggest a better alternative, that uses any or all of Wikipepdia, DBpedia, Freebase and their cousins, or combines a different approach. I would really prefer open source software that can be downloaded and set up on a server (eg. Apache Mahout), rather than a paid web service.

标签： data-mining matching semantic-web bigdata

2条回答

狗以群分

2楼-- · 2019-08-23 03:12

You can't tell that "Thriller" is a song, not a music video or film genre or Lambchop album without additional context.

After you've identified what your items are, it's "simply" a matter of traversing the graph of connections in Freebase, MusicBrainz, or whatever other information sources you are using.

You'll need to decide how you're going to weight things for scoring though. Are two Michael Jackson songs more closely related because they share the same type or are they more closely related to the artist Michael Jackson because they're directly connect to him?

0人赞添加讨论(0) 举报

太酷不给撩

3楼-- · 2019-08-23 03:20

It's not so much a matter of programming, but of data.

So it's not really a question for StackOverflow.

What you really want is to use WordNet I guess. That is really meant as a database for reasoning about the meaning of words. So for example, the data explicitely states that data mining is a form of data processing. And which is a physical entity...

You see, the reasoning will be only as good as your data is.

DBPedia may also include a mapping from WordNet to Wikipedia maybe?

0人赞添加讨论(0) 举报

Compare two strings and find how closely they are

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间