how does postgres handle the bit data type?

2019-07-17 15:24发布

i have a table with a column vector of type bit(2000). how does the db engine handle operations AND and OR over this values? does it simply divide into 32bit chunks (or 64, respectively) and then compares each chunk separately and in the end simply concats the results together? or does it handle simply as two strings?

my point is to predict, which use case would be faster. i got a key-value data (user-item).

userID | itemID
U1     | I1
U1     | Ix
Un     | Ij

for each user i want to calculate a list of n nearest neighbors (using jaccard index, for example).

select my_jaccard(select itemID from table where userID=U1,select itemID from table where userID=U2)

my solution - i parsed the input data into a table of user-vector, where the vector is of type bit(2000) with 1's on the position representing the particular item.

userID | vector
U1     | 00.......01
U1     | 0..1.....00
Un     | 00..1..1..0

upon this table i simply do

select vector1&vector2

the point is that each user has at most only 10 records for all items, i.e. the vector has maximum of 10 active bits. i think, that parsing the whole bitvector just to find the active bits needs more computational resources, than simply comparing those 10 values of user1 with 10 values of user2 each with another.

is it faster to use long bit-vectors which have very few bits set to 1 or is it better to use to original values as a sets and compare two sets together? (a set has maximum of 10 items)

i use both psql v8.2 and v9.x

2条回答
家丑人穷心不美
2楼-- · 2019-07-17 15:59

The source code seems to compare byte-by-byte. Search the PostgreSQL source code for the functions "bit_and" and "bit_or". (There doesn't seem to be a natural way for me to link directly to a function.)

Excerpt of bit_and(), lines 1205 to 1209 of varbit.c

p1 = VARBITS(arg1);
p2 = VARBITS(arg2);
r = VARBITS(result);
for (i = 0; i < VARBITBYTES(arg1); i++)
    *r++ = *p1++ & *p2++;
查看更多
爷的心禁止访问
3楼-- · 2019-07-17 16:08

Bit operations on the bit types are handled internally as, er, bit operations. Here is what the "and" code does, for example:

p1 = VARBITS(arg1);
p2 = VARBITS(arg2);
r = VARBITS(result);
for (i = 0; i < VARBITBYTES(arg1); i++)
    *r++ = *p1++ & *p2++;

(So it's actually 8-bit chunks.)

So I think this should be pretty fast.

查看更多
登录 后发表回答