I have asked similar question in R about creating hash value for each row of data. I know that I can use something like hashlib.md5(b'Hello World').hexdigest()
to hash a string, but how about a row in a dataframe?
update 01
I have drafted my code as below:
for index, row in course_staff_df.iterrows():
temp_df.loc[index,'hash'] = hashlib.md5(str(row[['cola','colb']].values)).hexdigest()
It seems not very pythonic to me, any better solution?
Or simply:
df.apply(lambda x: hash(tuple(x)), axis = 1)
As an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(3,5))
print df
df.apply(lambda x: hash(tuple(x)), axis = 1)
0 1 2 3 4
0 0.728046 0.542013 0.672425 0.374253 0.718211
1 0.875581 0.512513 0.826147 0.748880 0.835621
2 0.451142 0.178005 0.002384 0.060760 0.098650
0 5024405147753823273
1 -798936807792898628
2 -8745618293760919309
You can sum the hashes of all of the elements in the row:
>>> sum(hash(i) for i in df.irow(0))
1985985746
A different method would be to coerce the row (a Series object) to a tuple:
>>> hash(tuple(df.irow(1)))
-4901655572611365671
To do so for every row, appended as a column would look like this:
>>> df['hash'] = pd.Series((sum(hash(e) for e in row) for i, row in df.iterrows()))
>>> df
y x0 hash
0 11.624345 10 1985985746
1 10.388244 11 1545726335
2 11.471828 12 2436256751
3 11.927031 13 2285800314
4 14.865408 14 3717237475
5 12.698461 15 2135377648
6 17.744812 16 2029383679
7 16.238793 17 2404124222
8 18.319039 18 1890913345
9 18.750630 19 2088408352
[10 rows x 3 columns]
If you'd rather hash the tuple of the row:
>>> df = df.drop('hash', 1) # lose the old hash
>>> df['hash'] = pd.Series((hash(tuple(row)) for _, row in df.iterrows()))
>>> df
y x0 hash
0 11.624345 10 -7519341396217622291
1 10.388244 11 -6224388738743104050
2 11.471828 12 -4278475798199948732
3 11.927031 13 -1086800262788974363
4 14.865408 14 4065918964297112768
5 12.698461 15 8870116070367064431
6 17.744812 16 -2001582243795030948
7 16.238793 17 4683560048732242225
8 18.319039 18 -4288960467160144170
9 18.750630 19 7149535252257157079
[10 rows x 3 columns]