I have a huge set of N-dimensional points (tens of millions; N is close to 100).
I need to map these points to a single dimension while preserving spatial locality. I want to use Hilbert space-filling curve to do it.
For each point I want to pick the closest point on the curve. The Hilbert value of the point (curve length from the start of curve to the picked point) is the single dimension value I seek.
Computation does not have to be instant, but I expect it to be no more than several hours on decent modern home PC hardware.
Any suggestions on implementation? Are there any libraries that would help me? (Language does not matter much.)
I finally broke down and shelled out some money. The AIP (American Institute of Physics) has a nice, short article with source code in C. "Programming the Hilbert curve" by John Skilling (from the AIP Conf. Proc. 707, 381 (2004)) has an appendix with code for mappings in both directions. It works for any number of dimensions > 1, is not recursive, does not use state-transition lookup tables that gobble up huge amounts of memory, and mostly uses bit operations. Thus it is reasonably fast and has a good memory footprint.
If you choose to purchase the article, I discovered an error in the source code.
The following line of code (found in the function TransposetoAxes) has the error:
for( i = n-1; i >= 0; i-- ) X[i] ^= X[i-1];
The correction is to change the greater than or equal (>=) to a greater than (>). Without this correction, the X array is accessed using a negative index when the variable “i” becomes zero, causing the program to fail.
I recommend reading the article (which is seven pages long, including the code), as it explains how the algorithm works, which is far from obvious.
I translated his code into C# for my own use. The code follows. Skilling performs the transformation in place, overwriting the vector that you pass in. I chose to make a clone of the input vector and return a new copy. Also, I implemented the methods as extension methods.
Skilling's code represents the Hilbert index as a transpose, stored as an array. I find it more convenient to interleave the bits and form a single BigInteger (more useful in Dictionaries, easier to iterate over in loops, etc), but I optimized that operation and its inverse with magic numbers, bit operations and the like, and the code is lengthy, so I have omitted it.
I have posted working code in C# to github.
See https://github.com/paulchernoch/HilbertTransformation
UPDATE: I just published (Fall 2019) a Rust crate on crates.io called "hilbert". It also uses Skilling's algorithm. See https://crates.io/crates/hilbert
It's not clear to me how this will do what you want. Consider this trival 3D case:
which can be "Hilbertized" by the following path:
into the 1D order:
Here's the nasty bit. Consider the list of pairs and 1D distances below:
In all cases, the left- and right-hand values are the same 3D distance from each other (+/- 1 in the first position), which seem to imply similar "spatial locality". But linearizing by any choice of dimensional ordering (y, then z, then z, in the above example) breaks that locality.
Another way of saying this is that taking a starting point and ordering the remaining points by their distance from that starting point will provide significantly different results. Taking
000
as the start, for example:This effect grows exponentially with the number of dimensions (assuming that each dimension has the same "size").
I spent a little time translating Paul Chernoch's code to Java and cleaning it up. It's possible that there is a bug in my code, especially because I don't have access to the paper that it's originally from. However, it passes what unit tests I was able to write. It is below.
Note that I have evaluated both Z-Order and Hilbert curves for spatial indexing on largish data sets. I have to say that Z-Order provides far better quality. But feel free to try for yourself.
I don't see how you can use a Hilbert curve in one dimension.If you are interested in mapping points to a lower dimension while preserving distances (with minimum error) then you can look into "Multidimensional Scaling" algorithms.
Simulated annealing is one approach.
Edit: Thanks for the comment. I see what you meant by the Hilbert Curve approach now. However, this is a hard problem, and given N=100 and 10 million data points I don't think any approach will preserve locality well and run in a reasonable amount of time. I don't think kd-trees will work here.
If finding a total ordering is not important to you, then you can look into locality-based hashing and other approximate nearest neighbor schemes. Hierarchical multidimensional scaling with buckets of points to reduce the input size might give you a good ordering, but again it's doubtful in such a high dimension.
Another possibility would be to build a kd-tree on your data, and then to an in-order traversal of the tree to get the ordering. Constructing the kd-tree only requires you to have a good median-finding algorithm, of which there are many.
Algorithm for mapping from n->1 and 1->n given here "Calculation of Mappings Between One and n-dimensional Values Using the Hilbert Space-filling Curve" J K Lawder
If you Google for "SFC module and Kademlia overlay" youl find a group who claim to use it as part of their system. If you view the source you can probably extract the relevant function.