I recently asked a question on Programmers regarding reasons to use manual bit manipulation of primitive types over std::bitset
.
From that discussion I have concluded that the main reason is its comparatively poorer performance, although I'm not aware of any measured basis for this opinion. So next question is:
what is the performance hit, if any, likely to be incurred by using std::bitset
over bit-manipulation of a primitive?
The question is intentionally broad, because after looking online I haven't been able to find anything, so I'll take what I can get. Basically I'm after a resource that provides some profiling of std::bitset
vs 'pre-bitset' alternatives to the same problems on some common machine architecture using GCC, Clang and/or VC++. There is a very comprehensive paper which attempts to answer this question for bit vectors:
http://www.cs.up.ac.za/cs/vpieterse/pub/PieterseEtAl_SAICSIT2010.pdf
Unfortunately, it either predates or considered out of scope std::bitset
, so it focuses on vectors/dynamic array implementations instead.
I really just want to know whether std::bitset
is better than the alternatives for the use cases it is intended to solve. I already know that it is easier and clearer than bit-fiddling on an integer, but is it as fast?
Rhetorical question: Why
std::bitset
is written in that inefficacy way? Answer: It is not.Another rhetorical question: What is difference between:
and
Answer: 50 times difference in performance http://quick-bench.com/iRokweQ6JqF2Il-T-9JSmR0bdyw
You need be very careful what you ask for,
bitset
support lot of things but each have it own cost. With correct handling you will have exactly same behavior as raw code:Both generate same assembly: https://godbolt.org/g/PUUUyd (64 bit GCC)
Another thing is that
bitset
is more portable but this have cost too:If
i > 64
then bit set will be zero and in case of unsigned we have UB.With check preventing UB both generate same code.
Another place is
set
and[]
, first one is safe and mean you will never get UB but this will cost you a branch.[]
have UB if you use wrong value but is fast as usingvar |= 1L<< i;
. Of corse ifstd::bitset
do not need have more bits than biggest int available on system because other wise you need split value to get correct element in internal table. This mean forstd::bitset<N>
sizeN
is very important for performance. If is bigger or smaller than optimal one you will pay cost of it.Overall I find that best way is use something like that:
This will remove cost of trimming exceeding bits: http://quick-bench.com/Di1tE0vyhFNQERvucAHLaOgucAY
Update
It's been ages since I posted this one, but:
If you are using
bitset
in a way that does actually make it clearer and cleaner than bit-fiddling, like checking for one bit at a time instead of using a bit mask, then inevitably you lose all those benefits that bitwise operations provide, like being able to check to see if 64 bits are set at one time against a mask, or using FFS instructions to quickly determine which bit is set among 64-bits.I'm not sure that
bitset
incurs a penalty to use in all ways possible (ex: using its bitwiseoperator&
), but if you use it like a fixed-size boolean array which is pretty much the way I always see people using it, then you generally lose all those benefits described above. We unfortunately can't get that level of expressiveness of just accessing one bit at a time withoperator[]
and have the optimizer figure out all the bitwise manipulations and FFS and FFZ and so forth going on for us, at least not since the last time I checked (otherwisebitset
would be one of my favorite structures).Now if you are going to use
bitset<N> bits
interchangeably with like, say,uint64_t bits[N/64]
as in accessing both the same way using bitwise operations, it might be on par (haven't checked since this ancient post). But then you lose many of the benefits of usingbitset
in the first place.for_each
methodIn the past I got into some misunderstandings, I think, when I proposed a
for_each
method to iterate through things likevector<bool>
,deque
, andbitset
. The point of such a method is to utilize the internal knowledge of the container to iterate through elements more efficiently while invoking a functor, just as some associative containers offer afind
method of their own instead of usingstd::find
to do a better than linear-time search.For example, you can iterate through all set bits of a
vector<bool>
orbitset
if you had internal knowledge of these containers by checking for 64 elements at a time using a 64-bit mask when 64 contiguous indices are occupied, and likewise use FFS instructions when that's not the case.But an iterator design having to do this type of scalar logic in
operator++
would inevitably have to do something considerably more expensive, just by the nature in which iterators are designed in these peculiar cases.bitset
lacks iterators outright and that often makes people wanting to use it to avoid dealing with bitwise logic to useoperator[]
to check each bit individually in a sequential loop that just wants to find out which bits are set. That too is not nearly as efficient as what afor_each
method implementation could do.Double/Nested Iterators
Another alternative to the
for_each
container-specific method proposed above would be to use double/nested iterators: that is, an outer iterator which points to a sub-range of a different type of iterator. Client code example:While not conforming to the flat type of iterator design available now in standard containers, this can allow some very interesting optimizations. As an example, imagine a case like this:
In that case, the outer iterator can, with just a few bitwise iterations ((FFZ/or/complement), deduce that the first range of bits to process would be bits [0, 6), at which point we can iterate through that sub-range very cheaply through the inner/nested iterator (it would just increment an integer, making
++inner_it
equivalent to just++int
). Then when we increment the outer iterator, it can then very quickly, and again with a few bitwise instructions, determine that the next range would be [7, 13). After we iterate through that sub-range, we're done. Take this as another example:In such a case, the first and last sub-range would be
[0, 16)
, and the bitset could determine that with a single bitwise instruction at which point we can iterate through all set bits and then we're done.This type of nested iterator design would map particularly well to
vector<bool>
,deque
, andbitset
as well as other data structures people might create like unrolled lists.I say that in a way that goes beyond just armchair speculation, since I have a set of data structures which resemble the likes of
deque
which are actually on par with sequential iteration ofvector
(still noticeably slower for random-access, especially if we're just storing a bunch of primitives and doing trivial processing). However, to achieve the comparable times tovector
for sequential iteration, I had to use these types of techniques (for_each
method and double/nested iterators) to reduce the amount of processing and branching going on in each iteration. I could not rival the times otherwise using just the flat iterator design and/oroperator[]
. And I'm certainly not smarter than the standard library implementers but came up with adeque
-like container which can be sequentially iterated much faster, and that strongly suggests to me that it's an issue with the standard interface design of iterators in this case which come with some overhead in these peculiar cases that the optimizer cannot optimize away.Old Answer
I'm one of those who would give you a similar performance answer, but I'll try to give you something a bit more in-depth than
"just because"
. It is something I came across through actual profiling and timing, not merely distrust and paranoia.One of the biggest problems with
bitset
andvector<bool>
is that their interface design is "too convenient" if you want to use them like an array of booleans. Optimizers are great at obliterating all that structure you establish to provide safety, reduce maintenance cost, make changes less intrusive, etc. They do an especially fine job with selecting instructions and allocating the minimal number of registers to make such code run as fast as the not-so-safe, not-so-easy-to-maintain/change alternatives.The part that makes the bitset interface "too convenient" at the cost of efficiency is the random-access
operator[]
as well as the iterator design forvector<bool>
. When you access one of these at indexn
, the code has to first figure out which byte the nth bit belongs to, and then the sub-index to the bit within that. That first phase typically involves a division/rshifts against an lvalue along with modulo/bitwise and which is more costly than the actual bit operation you're trying to perform.The iterator design for
vector<bool>
faces a similar awkward dilemma where it either has to branch into different code every 8+ times you iterate through it or pay that kind of indexing cost described above. If the former is done, it makes the logic asymmetrical across iterations, and iterator designs tend to take a performance hit in those rare cases. To exemplify, ifvector
had afor_each
method of its own, you could iterate through, say, a range of 64 elements at once by just masking the bits against a 64-bit mask forvector<bool>
if all the bits are set without checking each bit individually. It could even use FFS to figure out the range all at once. An iterator design would tend to inevitably have to do it in a scalar fashion or store more state which has to be redundantly checked every iteration.For random access, optimizers can't seem to optimize away this indexing overhead to figure out which byte and relative bit to access (perhaps a bit too runtime-dependent) when it's not needed, and you tend to see significant performance gains with that more manual code processing bits sequentially with advanced knowledge of which byte/word/dword/qword it's working on. It's somewhat of an unfair comparison, but the difficulty with
std::bitset
is that there's no way to make a fair comparison in such cases where the code knows what byte it wants to access in advance, and more often than not, you tend to have this info in advance. It's an apples to orange comparison in the random-access case, but you often only need oranges.Perhaps that wouldn't be the case if the interface design involved a
bitset
whereoperator[]
returned a proxy, requiring a two-index access pattern to use. For example, in such a case, you would access bit 8 by writingbitset[0][6] = true; bitset[0][7] = true;
with a template parameter to indicate the size of the proxy (64-bits, e.g.). A good optimizer may be able to take such a design and make it rival the manual, old school kind of way of doing the bit manipulation by hand by translating that into:bitset |= 0x60;
Another design that might help is if
bitsets
provided afor_each_bit
kind of method, passing a bit proxy to the functor you provide. That might actually be able to rival the manual method.std::deque
has a similar interface problem. Its performance shouldn't be that much slower thanstd::vector
for sequential access. Yet unfortunately we access it sequentially usingoperator[]
which is designed for random access or through an iterator, and the internal rep of deques simply don't map very efficiently to an iterator-based design. If deque provided afor_each
kind of method of its own, then there it could potentially start to get a lot closer tostd::vector's
sequential access performance. These are some of the rare cases where that Sequence interface design comes with some efficiency overhead that optimizers often can't obliterate. Often good optimizers can make convenience come free of runtime cost in a production build, but unfortunately not in all cases.Sorry!
Also sorry, in retrospect I wandered a bit with this post talking about
vector<bool>
anddeque
in addition tobitset
. It's because we had a codebase where the use of these three, and particularly iterating through them or using them with random-access, were often hotspots.Apples to Oranges
As emphasized in the old answer, comparing straightforward usage of
bitset
to primitive types with low-level bitwise logic is comparing apples to oranges. It's not likebitset
is implemented very inefficiently for what it does. If you genuinely need to access a bunch of bits with a random access pattern which, for some reason or other, needs to check and set just one bit a time, then it might be ideally implemented for such a purpose. But my point is that almost all use cases I've encountered didn't require that, and when it's not required, the old school way involving bitwise operations tends to be significantly more efficient.Did a short test profiling std::bitset vs bool arrays for sequential and random access - you can too:
Please note: the outputting of the sum total is necessary so the compiler doesn't optimise out the for loop - which some do if the result of the loop isn't used.
Under GCC x64 with the following flags: -O2;-Wall;-march=native;-fomit-frame-pointer;-std=c++11; I get the following results:
Bool array: random access time = 4695, sequential access time = 390
Bitset: random access time = 5382, sequential access time = 749
In addition to what the other answers said about the performance of access, there may also be a significant space overhead: Typical
bitset<>
implementations simply use the longest integer type to back their bits. Thus, the following codeproduces the following output on my machine:
As you see, my compiler allocates a whopping 64 bits to store a single one, with the bitfield approach, I only need to round up to eight bits.
This factor eight in space usage can become important if you have a lot of small bitsets.
Not a great answer here, but rather a related anecdote:
A few years ago I was working on real-time software and we ran into scheduling problems. There was a module which was way over time-budget, and this was very surprising because the module was only responsible for some mapping and packing/unpacking of bits into/from 32-bit words.
It turned out that the module was using std::bitset. We replaced this with manual operations and the execution time decreased from 3 milliseconds to 25 microseconds. That was a significant performance issue and a significant improvement.
The point is, the performance issues caused by this class can be very real.