Document Clustering Basics

2020-07-30 03:08发布

So, I've been mulling over these concepts for some time, and my understanding is very basic. Information retrieval seems to be a topic seldom covered in the wild...

My questions stem from the process of clustering documents. Let's say I start off with a collection of documents containing only interesting words. What is the first step here? Parse the words from each document and create a giant 'bag-of-words' type model? Do I then proceed to create vectors of word counts for each document? How do I compare these documents using something like the K-means clustering?

标签： cluster-analysis document k-means

1条回答

傲

2楼-- · 2020-07-30 03:16

Try Tf-idf for starters.
If you read Python, look at "Clustering text documents using MiniBatchKmeans" in scikit-learn:
"an example showing how the scikit-learn can be used to cluster documents by topics using a bag-of-words approach".
Then feature_extraction/text.py in the source has very nice classes.

0人赞添加讨论(0) 举报

Document Clustering Basics

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间