Deciding input values to DBSCAN algorithm

2019-09-01 02:45发布

站内文章 / Python

37 0

混吃等死

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have written code in python to implement DBSCAN clustering algorithm. My dataset consists of 14k users with each user represented by 10 features. I am unable to decide what exactly to keep as the value of Min_samples and epsilon as input How should I decide that? Similarity measure is euclidean distance.(Hence it becomes even more tough to decide.) Any pointers?

回答1:

DBSCAN is pretty often hard to estimate its parameters.

Did you think about the OPTICS algorithm? You only need in this case Min_samples which would correspond to the minimal cluster size.

Otherwise for DBSCAN I've done it in the past by trial and error : try some values and see what happens. A general rule to follow is that if your dataset is noisy, you should have a larger value, and it is also correlated with the number of dimensions (10 in this case).