3D clustering Algorithm

2019-03-10 08:55发布

问题:

Problem Statement: I have the following problem:

There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points.

Goal: To find an algorithm that can scale well to many processors and has a small memory requirement.

Thoughts: Normal spatial decomposition is not sufficient for this kind of problem due to the non-uniform distribution. irregular spatial decomposition that evenly divide the number of points may help us the problem. I will really appreciate that if someone can shed some lights on how to solve this problem.

回答1:

Use an Octree. For 3D data with a limited value domain that scales very well to huge data sets.

Many of the aforementioned methods such as locality sensitive hashing are approximate versions designed for much higher dimensionality where you can't split sensibly anymore.

Splitting at each level into 8 bins (2^d for d=3) works very well. And since you can stop when there are too few points in a cell, and build a deeper tree where there are a lot of points that should fit your requirements quite well.

For more details, see Wikipedia:

https://en.wikipedia.org/wiki/Octree

Alternatively, you could try to build an R-tree. But the R-tree tries to balance, making it harder to find the most dense areas. For your particular task, this drawback of the Octree is actually helpful! The R-tree puts a lot of effort into keeping the tree depth equal everywhere, so that each point can be found at approximately the same time. However, you are only interested in the dense areas, which will be found on the longest paths in the Octree without even having to look at the actual points yet!



回答2:

I don't have a definite answer for you, but I have a suggestion for an approach that might yield a solution.

I think it's worth investigating locality-sensitive hashing. I think dividing the points evenly and then applying this kind of LSH to each set should be readily parallelisable. If you design your hashing algorithm such that the bucket size is defined in terms of R, it seems likely that for a given set of points divided into buckets, the points satisfying your criteria are likely to exist in the fullest buckets.

Having performed this locally, perhaps you can apply some kind of map-reduce-style strategy to combine spatial buckets from different parallel runs of the LSH algorithm in a step-wise manner, making use of the fact that you can begin to exclude parts of your problem space by discounting entire buckets. Obviously you'll have to be careful about edge cases that span different buckets, but I suspect that at each stage of merging, you could apply different bucket sizes/offsets such that you remove this effect (e.g. perform merging spatially equivalent buckets, as well as adjacent buckets). I believe this method could be used to keep memory requirements small (i.e. you shouldn't need to store much more than the points themselves at any given moment, and you are always operating on small(ish) subsets).

If you're looking for some kind of heuristic then I think this result will immediately yield something resembling a "good" solution - i.e. it will give you a small number of probable points which you can check satisfy your criteria. If you are looking for an exact answer, then you are going to have to apply some other methods to trim the search space as you begin to merge parallel buckets.

Another thought I had was that this could relate to finding the metric k-center. It's definitely not the exact same problem, but perhaps some of the methods used in solving that are applicable in this case. The problem is that this assumes you have a metric space in which computing the distance metric is possible - in your case, however, the presence of a billion points makes it undesirable and difficult to perform any kind of global traversal (e.g. sorting of the distances between points). As I said, just a thought, and perhaps a source of further inspiration.



回答3:

Here are some possible parts of a solution. There are various choices at each stage, which will depend on Ncluster, on how fast the data changes, and on what you want to do with the means.

3 steps: quantize, box, K-means.

1) quantize: reduce the input XYZ coordinates to say 8 bits each, by taking 2^8 percentiles of X,Y,Z separately. This will speed up the whole flow without much loss of detail. You could sort all 1G points, or just a random 1M, to get 8-bit x0 < x1 < ... x256, y0 < y1 < ... y256, z0 < z1 < ... z256 with 2^(30-8) points in each range. To map float X -> 8 bit x, unrolled binary search is fast — see Bentley, Pearls p. 95.

Added: Kd trees split any point cloud into different-sized boxes, each with ~ Leafsize points — much better than splitting X Y Z as above. But afaik you'd have to roll your own Kd tree code to split only the first say 16M boxes, and keep counts only, not the points.

2) box: count the number of points in each 3d box, [xj .. xj+1, yj .. yj+1, zj .. zj+1]. The average box will have 2^(30-3*8) points; the distribution will depend on how clumpy the data is. If some boxes are too big or get too many points, you could a) split them into 8, b) track the centre of the points in each box, otherwide just take box midpoints.

3) K-means clustering on the 2^(3*8) box centres. (Google parallel "k means" -> 121k hits.) This depends strongly on K aka Ncluster, also on your radius R. A rough approach would be to grow a heap of the say 27*Ncluster boxes with the most points, then take the biggest ones subject to your Radius constraint. (I like to start with a Minimum spanning tree, then remove the K-1 longest links to get K clusters.) See also Color quantization .

I'd make Nbit, here 8, a parameter from the beginning.

What is your Ncluster ?

Added: if your points are moving in time, see collision-detection-of-huge-number-of-circles on SO.



回答4:

I would also suggest to use an octree. The OctoMap framework is very good at dealing with huge 3D point clouds. It does not store all the points directly, but updates the occupancy density of every node (aka 3D box). After the tree is built, you can use a simple iterator to find the node with the highest density. If you would like to model the point density or distribution inside the nodes, the OctoMap is very easy to adopt.

Here you can see how it was extended to model the point distribution using a planar model.



回答5:

Just an idea. Create a graph with given points and edges between points when distance < R.

Creation of this kind of graph is similar to spatial decomposition. Your questions can be answered with local search in graph. First are vertices with max degree, second is finding of maximal unconnected set of max degree vertices.

I think creation of graph and search can be made parallel. This approach can have large memory requirement. Splitting domain and working with graphs for smaller volumes can reduce memory need.