Interpret columns of zeros and ones as binary and

I have a dataframe of zeros and ones. I want to treat each column as if its values were a binary representation of an integer. What is easiest way to make this conversion?

I want this:

df = pd.DataFrame([[1, 0, 1], [1, 1, 0], [0, 1, 1], [0, 0, 1]])

print df

   0  1  2
0  1  0  1
1  1  1  0
2  0  1  1
3  0  0  1

converted to:

0    12
1     6
2    11
dtype: int64

As efficiently as possible.

标签： python numpy pandas binary integer

3条回答

甜甜的少女心

2楼-- · 2020-04-10 04:17

Similar in concept to @jezrael's solution that used dot-product, but with couple of improvements. We can avoid the transpose by bringing the 2-powered range array from the front for the dot-product. This would be beneficial for large arrays, as transposing them would have some overhead. Also, operating on NumPy arrays would be better for these number crunching cases, so we could operate on df.values instead. At the end, we need to convert to pandas series/dataframe for the final output.

Thus, combining these two improvements, the modified implementation would be -

pd.Series((2**np.arange(df.shape[0]-1,-1,-1)).dot(df.values))

Runtime test -

In [159]: df = pd.DataFrame(np.random.randint(0,2,(4,10000)))

In [160]: p1 = pd.Series((2**np.arange(df.shape[0]-1,-1,-1)).dot(df.values))

# @jezrael's solution
In [161]: p2 = (df.T.dot(1 << np.arange(df.shape[0] - 1, -1, -1)))

In [162]: np.allclose(p1.values, p2.values)
Out[162]: True

In [163]: %timeit pd.Series((2**np.arange(df.shape[0]-1,-1,-1)).dot(df.values))
1000 loops, best of 3: 268 µs per loop

# @jezrael's solution
In [164]: %timeit (df.T.dot(1 << np.arange(df.shape[0] - 1, -1, -1)))
1000 loops, best of 3: 554 µs per loop

0人赞添加讨论(0) 举报

你好瞎i

3楼-- · 2020-04-10 04:32

You can create a string from the column values and then use int(binary_string, base=2) to convert to integer:

df.apply(lambda col: int(''.join(str(v) for v in col), 2))
Out[6]: 
0    12
1     6
2    11
dtype: int64

Not sure about efficiency, multiplying by the relevant powers of 2 then summing probably takes better advantage of fast numpy operations, this is probably more convenient though.

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

4楼-- · 2020-04-10 04:40

Similar solution, but more faster:

print (df.T.dot(1 << np.arange(df.shape[0] - 1, -1, -1)))
0    12
1     6
2    11
dtype: int64

Timings:

In [81]: %timeit df.apply(lambda col: int(''.join(str(v) for v in col), 2))
The slowest run took 5.66 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 264 µs per loop

In [82]: %timeit (df.T*(1 << np.arange(df.shape[0]-1, -1, -1))).sum(axis=1)
1000 loops, best of 3: 492 µs per loop

In [83]: %timeit (df.T.dot(1 << np.arange(df.shape[0] - 1, -1, -1)))
The slowest run took 6.14 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 204 µs per loop

0人赞添加讨论(0) 举报

Interpret columns of zeros and ones as binary and

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间