Create hash value for each row of data with select

2019-01-26 07:39发布

I have asked similar question in R about creating hash value for each row of data. I know that I can use something like hashlib.md5(b'Hello World').hexdigest() to hash a string, but how about a row in a dataframe?

update 01

I have drafted my code as below:

for index, row in course_staff_df.iterrows():
        temp_df.loc[index,'hash'] = hashlib.md5(str(row[['cola','colb']].values)).hexdigest()

It seems not very pythonic to me, any better solution?

2条回答
看我几分像从前
2楼-- · 2019-01-26 07:50

You can sum the hashes of all of the elements in the row:

>>> sum(hash(i) for i in df.irow(0))
1985985746

A different method would be to coerce the row (a Series object) to a tuple:

>>> hash(tuple(df.irow(1)))
-4901655572611365671

To do so for every row, appended as a column would look like this:

>>> df['hash'] = pd.Series((sum(hash(e) for e in row) for i, row in df.iterrows()))
>>> df
           y  x0        hash
0  11.624345  10  1985985746
1  10.388244  11  1545726335
2  11.471828  12  2436256751
3  11.927031  13  2285800314
4  14.865408  14  3717237475
5  12.698461  15  2135377648
6  17.744812  16  2029383679
7  16.238793  17  2404124222
8  18.319039  18  1890913345
9  18.750630  19  2088408352

[10 rows x 3 columns]

If you'd rather hash the tuple of the row:

>>> df = df.drop('hash', 1) # lose the old hash
>>> df['hash'] = pd.Series((hash(tuple(row)) for _, row in df.iterrows()))
>>> df
           y  x0                 hash
0  11.624345  10 -7519341396217622291
1  10.388244  11 -6224388738743104050
2  11.471828  12 -4278475798199948732
3  11.927031  13 -1086800262788974363
4  14.865408  14  4065918964297112768
5  12.698461  15  8870116070367064431
6  17.744812  16 -2001582243795030948
7  16.238793  17  4683560048732242225
8  18.319039  18 -4288960467160144170
9  18.750630  19  7149535252257157079

[10 rows x 3 columns]
查看更多
3楼-- · 2019-01-26 08:04

Or simply:

df.apply(lambda x: hash(tuple(x)), axis = 1)

As an example:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(3,5))
print df
df.apply(lambda x: hash(tuple(x)), axis = 1)

     0         1         2         3         4
0  0.728046  0.542013  0.672425  0.374253  0.718211
1  0.875581  0.512513  0.826147  0.748880  0.835621
2  0.451142  0.178005  0.002384  0.060760  0.098650

0    5024405147753823273
1    -798936807792898628
2   -8745618293760919309
查看更多
登录 后发表回答