Number of buckets in LSH

2019-07-13 00:55发布

In LSH, you hash slices of the documents into buckets. The idea is that these documents that fell into the same buckets will be potentially similar, thus a nearest neighbor, possibly.

For 40.000 documents, what is a good value (pretty much) for the number of buckets?

I have it as: number_of_buckets = 40.000/4 now, but I feel it can be reduced more.

Any ideas, please?

Relative: How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)?

标签： hash document nearest-neighbor locality-sensitive-hash bigdata

1条回答

手持菜刀，她持情操

2楼-- · 2019-07-13 01:14

A common starting point is to use sqrt(n) buckets for n documents. You can try doubling and halving that and run some analysis to see what kind of document distributions you got. Naturally any other exponent can be tried as well, and even K * log(n) if you expect that the number of distinct clusters grows "slowly".

I don't think this is an exact science yet, belongs on the similar topic as choosing the optimal k for k-means clustering.

0人赞添加讨论(0) 举报

Number of buckets in LSH

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间