Number of buckets in LSH

2019-07-13 01:12发布

站内文章 / 后端开发

75 0

我命由我不由天

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

In LSH, you hash slices of the documents into buckets. The idea is that these documents that fell into the same buckets will be potentially similar, thus a nearest neighbor, possibly.

For 40.000 documents, what is a good value (pretty much) for the number of buckets?

I have it as: number_of_buckets = 40.000/4 now, but I feel it can be reduced more.

Any ideas, please?

Relative: How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)?

回答1:

A common starting point is to use sqrt(n) buckets for n documents. You can try doubling and halving that and run some analysis to see what kind of document distributions you got. Naturally any other exponent can be tried as well, and even K * log(n) if you expect that the number of distinct clusters grows "slowly".

I don't think this is an exact science yet, belongs on the similar topic as choosing the optimal k for k-means clustering.