Hash Function with Order Preserving

2019-06-18 14:28发布

Is there any hash function with uniq hash code (like MD5) with order preserving?

NOTE: i don't care about security, i need it for sorting, i have lot of chunks with (~1MB size) and i want to sort them, of course i can use index sort but i want to reduce time of compare

Theoreticaly: if i have 1'000'000 chunks with 1MB size (1'048'576 byte) and all of them have difference in last 10 bytes then time of compare of one chunk to other will be O(n-10) and if i will use QuictSort (which make ~(nlog2(n)) compares) then total time of compare will be nlog2(n)*(k-10) (where k is chunk size) 1'000'000 * 20 * (1'048'576 - 10)

that's why i want to generate order preserved hash codes with fixed size (for example 16 bytes) once then sort chunks and save result (for example: in file)

5条回答
手持菜刀,她持情操
2楼-- · 2019-06-18 14:37

According to NIST (I'm no expert) a Pearson hash can be order-preserving. The hash uses an auxiliary table. Such a table can (in theory) be constructed such that the resulting hash is order preserving.

It doesn't meet your full requirements though, because it doesn't reduce the size as you would like. I'm posting this in case other people are looking for a solution.

Some pointers:

查看更多
smile是对你的礼貌
3楼-- · 2019-06-18 14:40

CHM (Z.J. Czech, G. Havas, and B.S. Majewski) is an algorithm which generates a minimal perfect hash that preserves ordering (e.g. if A < B, then h(A) < h(B)). It uses approximately 8 bytes of storage per key.

See: http://cmph.sourceforge.net/chm.html

查看更多
倾城 Initia
4楼-- · 2019-06-18 14:40

In general case, such a function is impossible unless the size of the hash is at least the size of the object.

The argument is trivial: if there are N objects but M < N hash values, by pigeonhole principle, two different objects are mapped to one hash value, and so their order is not preserved.

If however we have additional properties of the objects guaranteed or the requirements relaxed, a custom or probabilistic solution may become possible.

查看更多
SAY GOODBYE
5楼-- · 2019-06-18 14:42

In theory there is no such thing. If you want, you can create a composed hash:

index:md5

I think this will resolve your needs.

查看更多
在下西门庆
6楼-- · 2019-06-18 14:49

Sorting an array of N strings each of length K can be done in just O (NK) or O (N^2 + NK) character comparisons.

For example, construct a trie.

Or do a kind of insertion sort. Construct the set of sorted strings S by adding strings to it one by one. For each new string P, traverse it, maintaining the (non-decreasing) index of the greatest string Q in S such that Q <= P. When the string P ends, insert it into S just after Q. Each of the O(N) insertions can be done in O(N+K) operations: O(N) times increasing the index distributed into K.


When you have indices of the strings in sorted order, just use them for your purposes instead of the "hashes" you want.

查看更多
登录 后发表回答