Performance of small sets in Python

I am looking for the most efficient way to represent small sets of integers in a given range (say 0-10) in Python. In this case, efficiency means fast construction (from an unsorted list), fast query (a couple of queries on each set), and reasonably fast construction of a sorted version (perhaps once per ten sets or so). A priori the candidates are using Python's builtin set type (fast query), using a sorted array (perhaps faster to constrct?), or using a bit-array (fast everything if I was in C... but I doubt Python will be that efficient (?)). Any advice of which one to choose?

Thanks.

标签： python list set bitarray

3条回答

对你真心纯属浪费

2楼-- · 2019-07-21 02:15

I'd use a bitmapping and store the members of a "set" in an int...which might actually be faster than the built-in set type in this case -- although I haven't tested that. It would definitely require less storage.

Update

I don't have the time right now to do a full set-like implementation and benchmark it against Python's built-in class, but here's what I believe is a working example illustrating my suggestion. As I think you'd agree, the code looks fairly fast as well as memory efficient.

Given Python's almost transparent "unlimited" long integer capabilities, what is written will automatically work with integer values in a much larger range than you need, although doing so would likely slow things down a bit. ;)

class BitSet(object):
    def __init__(self, *bitlist):
        self._bitmap = 0
        for bitnum in bitlist:
            self._bitmap |= (1 << bitnum)

    def add(self, bitnum):
        self._bitmap |= (1 << bitnum)

    def remove(self, bitnum):
        if self._bitmap & (1 << bitnum):
            self._bitmap &= ~(1 << bitnum)
        else:
            raise KeyError

    def discard(self, bitnum):
       self._bitmap &= ~(1 << bitnum)

    def clear(self):
        self._bitmap = 0

    def __contains__(self, bitnum):
        return bool(self._bitmap & (1 << bitnum))

    def __int__(self):
        return self._bitmap

if __name__ == '__main__':

    bs = BitSet()

    print '28 in bs:', 28 in bs
    print 'bs.add(28)'
    bs.add(28)
    print '28 in bs:', 28 in bs

    print
    print '5 in bs:', 5 in bs
    print 'bs.add(5)'
    bs.add(5)
    print '5 in bs:', 5 in bs

    print
    print 'bs.remove(28)'
    bs.remove(28)
    print '28 in bs:', 28 in bs

0人赞添加讨论(0) 举报

乱世女痞

3楼-- · 2019-07-21 02:23

My advice is to stick with the built-in set(). It will be very difficult to write Python code that beats the built-in C code for performance. Speed of construction and speed of lookup will be fastest if you are relying on the built-in C code.

For a sorted list, your best bet is to use the built-in sort feature:

x = set(seq) # build set from some sequence
lst = sorted(x)  # get sorted list from set

In general, in Python, the less code you write, the faster it is. The more you can rely on the built-in C underpinnings of Python, the faster. Interpreted Python is 20x to 100x slower than C code in many cases, and it is extremely hard to be so clever that you come out ahead vs. just using the built-in features as intended.

If your sets are guaranteed to always be integers in the range of [0, 10], and you want to make sure the memory footprint is as small as possible, then bit-flags inside an integer would be the way to go.

pow2 = [2**i for i in range(32)]

x = 0  # set with no values
def add_to_int_set(x, n):
    return x | pow2[n]

def in_int_set(x, n):
    return x & pow2[n]

def list_from_int_set(x):
    return [i for i in range(32) if x & pow2[i]]

I'll bet this is actually slower than using the built-in set() functions, but you know that each set will just be an int object: 4 bytes, plus the overhead of a Python object.

If you literally needed billions of them, you could save space by using a NumPy array instead of a Python list; the NumPy array will just store bare integers. In fact, NumPy has a 16-bit integer type, so if your sets are really only in the range of [0, 10] you could get the storage size down to two bytes each using a NumPy array.

http://www.scipy.org/FAQ#head-16a621f03792969969e44df8a9eb360918ce9613

0人赞添加讨论(0) 举报

地球回转人心会变

4楼-- · 2019-07-21 02:30

In this case you might just use a list of True/False values. The hash table used by set will be doing the same thing, but it will include overhead for hashing, bucket assignment, and collision detection.

myset = [False] * 11
for i in values:
    myset[i] = True
mysorted = [i for i in range(11) if myset[i]]

As always you need to time it yourself to know how it works in your circumstances.

0人赞添加讨论(0) 举报

Performance of small sets in Python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间