Building or Finding a “relevant terms” suggestion

2019-03-09 18:58发布

问题:

Given a few words of input, I want to have a utility that will return a diverse set of relevant terms, phrases, or concepts. A caveat is that it would need to have a large graph of terms to begin with, or else the feature would not be very useful.

For example, submitting "baseball" would return

["shortstop", "Babe Ruth", "foul ball", "steroids", ... ]

Google Sets is the best example I can find of this kind of feature, but I can't use it since they have no public API (and I wont go against their TOS). Also, single-word input doesn't garner a very diverse set of results. I'm looking for a solution that goes off on tangents.

The closest I've experimented with is using WikiPedia's API to search Categories and Backlinks, but there's no way to directly sort those results by "relevance" or "popularity". Without that, the suggestion list is massive and all over the place, which is not immediately useful and very hard to whittle down.

Using A Thesaurus could also work minimally, but that would leave out any proper nouns or tangentially relevant terms (like any of the results listed above).


I would happily reuse an open service, if one exists, but I haven't found anything sufficient.

I'm looking for either a way to implement this either in-house with a decently-populated starting set, or reuse a free service that offers this.

Have a solution? Thanks ahead of time!


UPDATE: Thank you for the incredibly dense & informative answers. I'll choose a winning answer in 6 to 12 months, when I'll hopefully understand what you've all suggested =)

回答1:

You might be interested in WordNet. It takes a bit of linguistic knowledge to understand the API, but basically the system is a database of meaning-based links between English words, which is more or less what you're searching for. I'm sure I can dig up more information if you want it.



回答2:

Peter Norvig (director of research at Google) spoke about how they do this at Google (specifically mentioning Google Sets) in a Facebook Tech Talk. The idea is that a relatively simple algorithm on a huge dataset (e.g. the entire web) is much better than a complicated algorithm on a small data set.

You could look at Google's n-gram collection as a starting point. You'd start to see what concepts are grouped together. Norvig hinted that internally Google has up to 7-grams for use in things like Google Translate.

If you're more ambitious, you could download all of Wikipedia's articles in the language you desire and create your own n-gram database.

The problem is even more complicated if you just have a single word; check out this recent thesis for more details on word sense disambiguation.

It's not an easy problem, but it is useful as you mentioned. In the end, I think you'll find that a really successful implementation will have a relatively simple algorithm and a whole lot of data.



回答3:

Take a look at the following two papers:

  • Clustering User Queries of a Search Engine [pdf]
  • Topic Detection by Clustering Keywords [pdf]
  • Here is my attempt at a very simplified explanation:

    If we have a database of past user queries, we can define a similarity function between two queries. For example: number of words in common. Now for each query in our database, we compute its similarity with each other query, and remember the k most similar queries. The non-overlapping words from these can be returned as "related terms".

    We can also take this approach with a database of documents containing information users might be searching for. We can define the similarity between two search terms as the number of documents containing both divided by the number of documents containing either. To decide which terms to test, we can scan the documents and throw out words that are either too common ('and', 'the', etc.) or that are too obscure.

    If our data permits, then we could see which queries led users to choosing which results, instead of comparing documents by content. For example if we had data that showed us that users searching for "Celtics" and "Lakers" both ended up clicking on espn.com, then we could call these related terms.

    If you're starting from scratch with no data about past user queries, then you can try Wikipedia, or the Bag of Words dataset as a database of documents. If you are looking for a database of user search terms and results, and if you are feeling adventurous, then you can take a look at the AOL Search Data.