Performance of small sets in Python

2019-07-21 02:01发布

问题:

I am looking for the most efficient way to represent small sets of integers in a given range (say 0-10) in Python. In this case, efficiency means fast construction (from an unsorted list), fast query (a couple of queries on each set), and reasonably fast construction of a sorted version (perhaps once per ten sets or so). A priori the candidates are using Python's builtin set type (fast query), using a sorted array (perhaps faster to constrct?), or using a bit-array (fast everything if I was in C... but I doubt Python will be that efficient (?)). Any advice of which one to choose?

Thanks.

回答1:

I'd use a bitmapping and store the members of a "set" in an int...which might actually be faster than the built-in set type in this case -- although I haven't tested that. It would definitely require less storage.

Update

I don't have the time right now to do a full set-like implementation and benchmark it against Python's built-in class, but here's what I believe is a working example illustrating my suggestion. As I think you'd agree, the code looks fairly fast as well as memory efficient.

Given Python's almost transparent "unlimited" long integer capabilities, what is written will automatically work with integer values in a much larger range than you need, although doing so would likely slow things down a bit. ;)

class BitSet(object):
    def __init__(self, *bitlist):
        self._bitmap = 0
        for bitnum in bitlist:
            self._bitmap |= (1 << bitnum)

    def add(self, bitnum):
        self._bitmap |= (1 << bitnum)

    def remove(self, bitnum):
        if self._bitmap & (1 << bitnum):
            self._bitmap &= ~(1 << bitnum)
        else:
            raise KeyError

    def discard(self, bitnum):
       self._bitmap &= ~(1 << bitnum)

    def clear(self):
        self._bitmap = 0

    def __contains__(self, bitnum):
        return bool(self._bitmap & (1 << bitnum))

    def __int__(self):
        return self._bitmap

if __name__ == '__main__':

    bs = BitSet()

    print '28 in bs:', 28 in bs
    print 'bs.add(28)'
    bs.add(28)
    print '28 in bs:', 28 in bs

    print
    print '5 in bs:', 5 in bs
    print 'bs.add(5)'
    bs.add(5)
    print '5 in bs:', 5 in bs

    print
    print 'bs.remove(28)'
    bs.remove(28)
    print '28 in bs:', 28 in bs


回答2:

In this case you might just use a list of True/False values. The hash table used by set will be doing the same thing, but it will include overhead for hashing, bucket assignment, and collision detection.

myset = [False] * 11
for i in values:
    myset[i] = True
mysorted = [i for i in range(11) if myset[i]]

As always you need to time it yourself to know how it works in your circumstances.



回答3:

My advice is to stick with the built-in set(). It will be very difficult to write Python code that beats the built-in C code for performance. Speed of construction and speed of lookup will be fastest if you are relying on the built-in C code.

For a sorted list, your best bet is to use the built-in sort feature:

x = set(seq) # build set from some sequence
lst = sorted(x)  # get sorted list from set

In general, in Python, the less code you write, the faster it is. The more you can rely on the built-in C underpinnings of Python, the faster. Interpreted Python is 20x to 100x slower than C code in many cases, and it is extremely hard to be so clever that you come out ahead vs. just using the built-in features as intended.

If your sets are guaranteed to always be integers in the range of [0, 10], and you want to make sure the memory footprint is as small as possible, then bit-flags inside an integer would be the way to go.

pow2 = [2**i for i in range(32)]

x = 0  # set with no values
def add_to_int_set(x, n):
    return x | pow2[n]

def in_int_set(x, n):
    return x & pow2[n]

def list_from_int_set(x):
    return [i for i in range(32) if x & pow2[i]]

I'll bet this is actually slower than using the built-in set() functions, but you know that each set will just be an int object: 4 bytes, plus the overhead of a Python object.

If you literally needed billions of them, you could save space by using a NumPy array instead of a Python list; the NumPy array will just store bare integers. In fact, NumPy has a 16-bit integer type, so if your sets are really only in the range of [0, 10] you could get the storage size down to two bytes each using a NumPy array.

http://www.scipy.org/FAQ#head-16a621f03792969969e44df8a9eb360918ce9613