I am looking for the most efficient way to represent small sets of integers in a given range (say 0-10) in Python. In this case, efficiency means fast construction (from an unsorted list), fast query (a couple of queries on each set), and reasonably fast construction of a sorted version (perhaps once per ten sets or so). A priori the candidates are using Python's builtin set type (fast query), using a sorted array (perhaps faster to constrct?), or using a bit-array (fast everything if I was in C... but I doubt Python will be that efficient (?)). Any advice of which one to choose?
Thanks.
I'd use a bitmapping and store the members of a "set" in an
int
...which might actually be faster than the built-inset
type in this case -- although I haven't tested that. It would definitely require less storage.Update
I don't have the time right now to do a full set-like implementation and benchmark it against Python's built-in class, but here's what I believe is a working example illustrating my suggestion. As I think you'd agree, the code looks fairly fast as well as memory efficient.
Given Python's almost transparent "unlimited" long integer capabilities, what is written will automatically work with integer values in a much larger range than you need, although doing so would likely slow things down a bit. ;)
My advice is to stick with the built-in
set()
. It will be very difficult to write Python code that beats the built-in C code for performance. Speed of construction and speed of lookup will be fastest if you are relying on the built-in C code.For a sorted list, your best bet is to use the built-in sort feature:
In general, in Python, the less code you write, the faster it is. The more you can rely on the built-in C underpinnings of Python, the faster. Interpreted Python is 20x to 100x slower than C code in many cases, and it is extremely hard to be so clever that you come out ahead vs. just using the built-in features as intended.
If your sets are guaranteed to always be integers in the range of [0, 10], and you want to make sure the memory footprint is as small as possible, then bit-flags inside an integer would be the way to go.
I'll bet this is actually slower than using the built-in
set()
functions, but you know that each set will just be anint
object: 4 bytes, plus the overhead of a Python object.If you literally needed billions of them, you could save space by using a NumPy
array
instead of a Python list; the NumPyarray
will just store bare integers. In fact, NumPy has a 16-bit integer type, so if your sets are really only in the range of [0, 10] you could get the storage size down to two bytes each using a NumPyarray
.http://www.scipy.org/FAQ#head-16a621f03792969969e44df8a9eb360918ce9613
In this case you might just use a list of True/False values. The hash table used by
set
will be doing the same thing, but it will include overhead for hashing, bucket assignment, and collision detection.As always you need to time it yourself to know how it works in your circumstances.