How __hash__ is implemented in Python 3.2?

2019-05-04 17:14发布

问题:

I want to make custom object hash-able (via pickling). I could find __hash__ algorithm for Python 2.x (see code below), but it obviously differs from hash for Python 3.2 (I wonder why?). Does anybody know how __hash__ implemented in Python 3.2?

#Version: Python 3.2

def c_mul(a, b):
    #C type multiplication
    return eval(hex((int(a) * b) & 0xFFFFFFFF)[:-1])

class hs:
    #Python 2.x algorithm for hash from http://effbot.org/zone/python-hash.htm
    def __hash__(self):
        if not self:
            return 0 # empty
        value = ord(self[0]) << 7
        for char in self:
            value = c_mul(1000003, value) ^ ord(char)
        value = value ^ len(self)
        if value == -1:
            value = -2
        return value


def main():
    s = ["PROBLEM", "PROBLEN", "PROBLEO", "PROBLEP"]#, "PROBLEQ", "PROBLER", "PROBLES"]
    print("Python 3.2 hash() bild-in")
    for c in s[:]: print("hash('", c, "')=", hex(hash(c)),  end="\n")
    print("\n")
    print("Python 2.x type hash: __hash__()")
    for c in s[:]: print("hs.__hash__('", c, "')=", hex(hs.__hash__(c)),  end="\n")


if __name__ == "__main__":
    main()

OUTPUT:
Python 3.2 hash() bild-in
hash(' PROBLEM ')= 0x7a8e675a
hash(' PROBLEN ')= 0x7a8e6759
hash(' PROBLEO ')= 0x7a8e6758
hash(' PROBLEP ')= 0x7a8e6747


Python 2.x type hash: __hash__()
hs.__hash__(' PROBLEM ')= 0xa638a41
hs.__hash__(' PROBLEN ')= 0xa638a42
hs.__hash__(' PROBLEO ')= 0xa638a43
hs.__hash__(' PROBLEP ')= 0xa638a5c

Edit: Difference explained, for Python 3.2 "Hash values are now values of a new type, Py_hash_t,etc.."

Edit2 @Pih Thanks [link] http://svn.python.org/view/python/trunk/Objects/stringobject.c?view=markup

static long
1263    string_hash(PyStringObject *a)
1264    {
1265        register Py_ssize_t len;
1266        register unsigned char *p;
1267        register long x;
1268    
1269        if (a->ob_shash != -1)
1270            return a->ob_shash;
1271        len = Py_SIZE(a);
1272        p = (unsigned char *) a->ob_sval;
1273        x = *p << 7;
1274        while (--len >= 0)
1275            x = (1000003*x) ^ *p++;
1276        x ^= Py_SIZE(a);
1277        if (x == -1)
1278            x = -2;
1279        a->ob_shash = x;
1280        return x;
1281    }

回答1:

The answer why they are different is written there:

Hash values are now values of a new type, Py_hash_t, which is defined to be the same size as a pointer. Previously they were of type long, which on some 64-bit operating systems is still only 32 bits long.

The hashing also consider new values to be calculate, take a look at

 sys.hash_info 

For strings, you can take a look at http://svn.python.org/view/python/trunk/Objects/stringobject.c?view=markup line 1263 string_hash(PyStringObject *a)



回答2:

I looked up the new function in the source (in unicodeobject.c) and rebuilt it in Python. Here it is:

def my_hash(string):
    x = ord(string[0]) << 7
    for c in string:
        x = (1000003 * x) ^ ord(c)
    x ^= len(string)
    needCorrection =  x & (1 << 65)
    x %= 2 ** 64
    if needCorrection:
        x = -~(-x ^ 0xFFFFFFFFFFFFFFFF)
    if x == -1:
        x = -2
    return x

This is 64-bit only, though. Now with correction for Python's weird behavior when numbers become negative. (You better don't think about this too much.)