Description :
I implemented the following class LabSetInt64 (see below code).
The goal here is to manipulate large sets of big integers (up to values of 10M) as fast as possible. My main requirements are focused on :
- !Crucial : Getting the size/cardinality of a set as fast as possible
- !Important : Being able to iterate very fast through a set
So, starting from the implementation below, it still remain two points I would like to discuss with you.
The "popcount()" lazy implementation
int LabSetInt64::popcount(uint64_t x) {
int count;
for (count=0; x; count++)
x &= x-1;
return count;
}
I know that the implementation I chose is one of the slowest in the market place, but I could not find out a better one by myself. So if you can point me to a better one (64 bits, of course)...
I heard about a very efficient algorithm to compute cardinality : The "MIT HAKMEM 169" algorithm >>
// MIT HAKMEM.
// MIT HAKMEM Count is funky. Consider a 3 bit number as being 4a+2b+c. If we
// shift it right 1 bit, we have 2a+b. Subtracting this from the original gives
// 2a+b+c. If we right-shift the original 3-bit number by two bits, we get a,
// and so with another subtraction we have a+b+c, which is the number of bits
// in the original number. How is this insight employed? The first assignment
// statement in the routine computes tmp. Consider the octal representation of
// tmp. Each digit in the octal representation is simply the number of 1¡äs in
// the corresponding three bit positions in n. The last return statement sums
// these octal digits to produce the final answer. The key idea is to add
// adjacent pairs of octal digits together and then compute the remainder
// modulus 63. This is accomplished by right-shifting tmp by three bits, adding
// it to tmp itself and ANDing with a suitable mask. This yields a number in
// which groups of six adjacent bits (starting from the LSB) contain the number
// of 1¡äs among those six positions in n. This number modulo 63 yields the
// final answer. For 64-bit numbers, we would have to add triples of octal
// digits and use modulus 1023. This is HACKMEM 169, as used in X11 sources.
// Source: MIT AI Lab memo, late 1970¡äs.
// http://gurmeet.net/2008/08/05/fast-bit-counting-routines/
int CheckMIT(unsigned int n)
{
/* works for 32-bit numbers only */
/* fix last line for 64-bit numbers */
register unsigned int tmp;
tmp = n - ((n >> 1) & 033333333333)
- ((n >> 2) & 011111111111);
// the remainder of 63 for i equals the sum of 7 octal numbers which
// consititutes the 32-bit integer.
// For example, 03456 % 63 == 034 + 056
return ((tmp + (tmp >> 3)) & 030707070707) % 63;
}
My problem with the algorithm above is that I don't understand it enough to turn it into "uint64_t" (64 bits) version (despite the few instructions to do it).
So if one of you could help me with that task (or at least help me to understand), that would be awesome.
Here is LabSetInt64.h :
/*
* LabSetInt64.h
*
* Created on: Feb 20, 2013
* Author: golgauth
*/
#ifndef LABSETINT64_H_
#define LABSETINT64_H_
#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <stdint.h>
#include <limits>
#include <algorithm>
#include <LabTimeUtils.h>
using namespace std;
namespace elps {
// Lets assume we need at most 1M distinct indices in our sets
#define DEFAULT_SIZE_IN_BITS 1000000
class LabSetInt64
{
public:
LabSetInt64();
LabSetInt64(int sizeinbits);
LabSetInt64(const LabSetInt64 &);
~LabSetInt64();
void Clear();
void Resize(int sizeinbits);
void Set(int i);
void UnSet(int i);
void Set(int i, bool b);
bool Find(int i);
int Cardinality();
int NextSetBit(int i);
void Print();
/**
* Should have been "=" operator, but this overload is not compatible
* with Cython, so we'll use "<<" instead.
* @param
* @return
*/
const LabSetInt64 & operator << ( const LabSetInt64 & );
LabSetInt64 operator ~ () const;
LabSetInt64 operator + ( const LabSetInt64 & );
LabSetInt64 operator * ( const LabSetInt64 & );
LabSetInt64 operator - ( const LabSetInt64 & );
LabSetInt64 operator ^ ( const LabSetInt64 & );
private:
uint64_t *data;
int data_len;
int bits_size;
int popcount(uint64_t x);
int nb_trailing_zeros(uint64_t v);
};
} /* namespace elps */
#endif /* LABSETINT64_H_ */
And here the LabSetInt64.cpp :
/*
* LabSetInt64.cpp
*
* Created on: Feb 20, 2013
* Author: golgauth
*/
#include "LabSetInt64.h"
namespace elps {
/** PUBLICS **/
LabSetInt64::LabSetInt64() : LabSetInt64(DEFAULT_SIZE_IN_BITS) {
}
LabSetInt64::LabSetInt64(int sizeinbits) {
bits_size = sizeinbits;
data_len = (bits_size + 63) / 64;
data = new uint64_t[data_len];
}
LabSetInt64::LabSetInt64(const LabSetInt64& src) {
bits_size = src.bits_size;
data_len = src.data_len;
data = new uint64_t[data_len];
for( int i = 0; i < data_len; i++ ) // copy into this set
data[i] = src.data[i];
}
LabSetInt64::~LabSetInt64() {
}
void LabSetInt64::Clear() {
std::fill_n(data, data_len, 0);
}
void LabSetInt64::Resize(int sizeinbits) {
bits_size = sizeinbits + 1;
int size = (bits_size + 63) / 64;
uint64_t *new_data = new uint64_t[size];
memcpy( new_data, data, data_len * sizeof(uint64_t) );
data_len = size;
delete [] data;
data = new_data;
}
void LabSetInt64::Set(int i) {
data[i / 64] |= (1l << (i % 64));
}
void LabSetInt64::UnSet(int i) {
data[i / 64] &= ~(1l << (i % 64));
}
void LabSetInt64::Set(int i, bool b) {
if(b) Set(i); else UnSet(i);
}
bool LabSetInt64::Find(int i) {
return ((data[i / 64]) & (1l << (i % 64))); // binary AND;
}
int LabSetInt64::Cardinality() {
int sum = 0;
for(int i=0; i<data_len; i++)
sum += popcount(data[i]);
//sum += __builtin_popcount(data[i]);
return sum;
}
int LabSetInt64::NextSetBit(int i) {
int x = i / 64;
long w = data[x];
w >>= (i % 64);
if (w != 0) {
return i + nb_trailing_zeros(w);
}
++x;
for (; x < data_len; ++x) {
if (data[x] != 0) {
return x * 64 + nb_trailing_zeros(data[x]);
}
}
return -1;
}
void LabSetInt64::Print() {
int cur_id = this->NextSetBit(0);
cout << "\nSet size : " << this->Cardinality() << endl;
cout << "{ ";
while (cur_id != -1)
{
cout << (cur_id) << " ";
cur_id = this->NextSetBit(cur_id+1);
}
cout << "}" << endl;;
}
/** PRIVATES **/
//This is better when most bits in x are 0
//It uses 3 arithmetic operations and one comparison/branch per "1" bit in x.
int LabSetInt64::popcount(uint64_t x) {
int count;
for (count=0; x; count++)
x &= x-1;
return count;
}
// output: c will count v's trailing zero bits,
// so if v is 1101000 (base 2), then c will be 3
int LabSetInt64::nb_trailing_zeros(uint64_t v) {
int c;
if (v)
{
v = (v ^ (v - 1)) >> 1; // Set v's trailing 0s to 1s and zero rest
for (c = 0; v; c++) {
v >>= 1;
}
}
else
c = 8 * sizeof(v);
return c;
}
/** ***************************************************************************
********************************* OPERATORS *******************************
*******************************************************************************/
/** PUBLICS **/
/*******************************************************************************
* TODO >> May be better to use "DEFAULT_SIZE_IN_BITS" instead of "data_len" *
* in all the operators !!! *
* *
* => For now, we assume that all the Sets are created with the default *
* constructor (aka : bit_size = DEFAULT_SIZE_IN_BITS) *
*******************************************************************************/
/**
* "operator = " assigns the right hand side to this set.
* returns: nothing
*/
const LabSetInt64 &LabSetInt64::operator << ( const LabSetInt64 &rhs )
{
if( &rhs != this ) // avoid self assignment
{
this->Resize(rhs.bits_size - 1);
for( int i = 0; i < data_len; i++ ) // copy into this set
data[i] = rhs.data[i];
}
return *this; // enable x << y << z;
}
/**
* "operator ~ " performs set complement operation (not).
* returns: pointer to set
*/
LabSetInt64 LabSetInt64::operator ~ () const
{
LabSetInt64 rv;
for( int i = 0; i < data_len; i++ )
rv.data[i] = ~data[i]; // bitwise complement
return rv;
}
/**
* "operator + " performs set union operation (or).
* returns: pointer to set
*/
LabSetInt64 LabSetInt64::operator + ( const LabSetInt64 &rhs )
{
LabSetInt64 rv;
for( int i = 0; i < data_len; i++ )
rv.data[i] = data[i] | rhs.data[i]; // bitwise OR
return rv;
}
/**
* "operator * " performs set intersection operation (and).
* returns: pointer to set
*/
LabSetInt64 LabSetInt64::operator * ( const LabSetInt64 &rhs )
{
LabSetInt64 rv;
for( int i = 0; i < data_len; i++ )
rv.data[i] = data[i] & rhs.data[i]; // bitwise AND
return rv;
}
/**
* "operator - " performs set difference operation.
* returns: pointer to set
*/
LabSetInt64 LabSetInt64::operator - ( const LabSetInt64 &rhs )
{
LabSetInt64 rv;
for( int i = 0; i < data_len; i++ )
rv.data[i] = data[i] & ( ~rhs.data[i] ); // bitwise a AND ~b
return rv;
}
/**
* "operator ^ " performs set symmetric difference operation (xor).
* returns: pointer to set
*/
LabSetInt64 LabSetInt64::operator ^ ( const LabSetInt64 &rhs )
{
LabSetInt64 rv;
for( int i = 0; i < data_len; i++ )
rv.data[i] = data[i] ^ rhs.data[i]; // bitwise XOR
return rv;
}
/** *****************************************************************************
*********************************************************************************/
} /* namespace elps */
For the rest of the implementation, I am quite happy with the performances, but once again, any good comments would be very very well appreciated.
Thanks a lot for your time.
EDIT : Well, after all I'm not completely convinced by the "sparse" solution since I need the union, intersection, difference features, which would be, I guess, a lot more costly than my "bitwise" version (need to iterate through a potentially huge amount of items), so I am still interested in getting some help on how to turn the above "MIT HAKMEM" algorithm to 64 bits. Many Thanks.
EDIT : This version contains some little imperfections. Please, better look at the source code given bellow.
So, after having studied some other libraries (BitMagic and a few C++ and java implementations), and a lot of bechmarkings :
(I rejected the "sparse" option because of the need of fast combination operators)...
I went to some interesting improvements.
The main improvement is that I now maintain the cardinality on the fly (meaning that counting fast is no longer a big issue).
Improvements :
Better iteration :
Two improvements : using gcc's builtin function for counting trailing zeros and avoiding calling it when the block starts with a 1 improved significantly the performances while iterating.
Better counting :
usage of builtin
sse4.2
popcount function :"_mm_popcnt_u64" requires additional compilation options "-m64 -msse4.2" (not "__builtin_popcount")
added functions like :
Faster than making the operation, and then counting.
"optim counting" option (
CNT_OPT
mode) : See the new class version bellowSlows down a little bit the combination operations (just a few, no real worries...)
Faster insert :
By inserting only if not found :
It is faster to first check if the bit is already set, than setting it in any cases. (the most zeros we try to insert, the most we save time). The performances remain quite unchanged in case of setting a non-set bit.
Added some faster operators :
A += B is now much faster than A = A + B (saves the instantiation of a new vector to be returned)
Added some convenience functions and operators : See the new class version bellow
[
Header
][
Implementation
]Note about portability :
One some architectures (like
Windows 7
+MinGW
), wherelong
is only 32 bits, replace in the above code, all occurrences of :1l
with1ll
__builtin_popcountl
with__builtin_popcountll
__builtin_ctzl
with__builtin_ctzll
This will ensure you're always using 64 bits numbers (
long long
).Well, I hope this was informative and it can be a good starting point for some others.
On his website, Russ Cox describes a very simple integer set data structure that has O(n) iteration (as opposed to O(U) where U is some maximum) and O(1) insert, delete and count. It uses no bit fiddling. It also tends to use more space, but can be extended to have O(1) amortized append (by simply using
std::vector
instead of plain arrays).(I would paste sample code, but I think Cox's description is lucid enough and I'd just introduce bugs.)