How to group / compare similar news articles

2019-03-13 10:54发布

问题:

In an app that i'm creating, I want to add functionality that groups news stories together. I want to group news stories about the same topic from different sources into the same group. For example, an article on XYZ from CNN and MSNBC would be in the same group. I am guessing its some sort of fuzzy logic comparison. How would I go about doing this from a technical standpoint? What are my options? We haven't even started the app yet, so we aren't limited in the technologies we can use.

Thanks, in advance for the help!

回答1:

This problem breaks down into a few subproblems from a machine learning standpoint.

First, you are going to want to figure out what properties of the news stories you want to group based on. A common technique is to use 'word bags': just a list of the words that appear in the body of the story or in the title. You can do some additional processing such as removing common English "stop words" that provide no meaning, such as "the", "because". You can even do porter stemming to remove redundancies with plural words and word endings such as "-ion". This list of words is the feature vector of each document and will be used to measure similarity. You may have to do some preprocessing to remove html markup.

Second, you have to define a similarity metric: similar stories score high in similarity. Going along with the bag of words approach, two stories are similar if they have similar words in them (I'm being vague here, because there are tons of things you can try, and you'll have to see which works best).

Finally, you can use a classic clustering algorithm, such as k-means clustering, which groups the stories together, based on the similarity metric.

In summary: convert news story into a feature vector -> define a similarity metric based on this feature vector -> unsupervised clustering.

Check out Google scholar, there probably have been some papers on this specific topic in the recent literature. A lot of these things that I just discussed are implemented in natural language processing and machine learning modules for most major languages.



回答2:

The problem can be broken down to:

  • How to represent articles (features, usually a bag of words with TF-IDF)
  • How to calculate similarity between two articles (cosine similarity is the most popular)
  • How to cluster articles together based on the above

There are two broad groups of clustering algorithms: batch and incremental. Batch is great if you've got all your articles ahead of time. Since you're clustering news, you've probably got your articles coming in incrementally, so you can't cluster them all at once. You'll need an incremental (aka sequential) algorithm, and these tend to be complicated.

You can also try http://www.similetrix.com, a quick Google search popped them up and they claim to offer this service via API.



回答3:

One approach would be to add tags to the articles when they are listed. One tag would be XYZ. Other tags might describe the article subject.

You can do that in a database. You can have an unlimited number of tags for each article. Then, the "groups" could be identified by one or more tags.

This approach is heavily dependent upon human beings assigning appropriate tags, so that the right articles are returned from the search, but not too many articles. It isn't easy to do really well.