I have a pandas DataFrame
like this:
df = pd.DataFrame({'custid':[1,2,3,4],
...: 'prod1':['jeans','tshirt','jacket','tshirt'],
...: 'prod1_hnode1':[1,2,3,2],
...: 'prod1_hnode2':[6,7,8,7],
...: 'prod2':['tshirt','jeans','jacket','shirt'],
...: 'prod2_hnode1':[2,1,3,4],
...: 'prod2_hnode2':[7,6,8,7]})
In [54]: df
Out[54]:
custid prod1 prod1_hnode1 prod1_hnode2 prod2 prod2_hnode1 \
0 1 jeans 1 6 tshirt 2
1 2 tshirt 2 7 jeans 1
2 3 jacket 3 8 jacket 3
3 4 tshirt 2 7 shirt 4
prod2_hnode2
0 7
1 6
2 8
3 7
How can I convert this to the following format:
dfnew = pd.DataFrame({'custid':[1,1,2,2,3,3,4,4],
...: 'prod':['prod1','prod2','prod1','prod2','prod1','prod2','prod1','prod2'],
...: 'rec':['jeans','tshirt','tshirt','jeans','jacket','jacket','tshirt','shirt'],
...: 'hnode1':[1,2,2,1,3,3,2,4],
...: 'hnode2':[6,7,7,6,8,8,7,7]})
In [56]: dfnew
Out[56]:
custid hnode1 hnode2 prod rec
0 1 1 6 prod1 jeans
1 1 2 7 prod2 tshirt
2 2 2 7 prod1 tshirt
3 2 1 6 prod2 jeans
4 3 3 8 prod1 jacket
5 3 3 8 prod2 jacket
6 4 2 7 prod1 tshirt
7 4 4 7 prod2 shirt
Use:
set_index
by columncustid
MultiIndex
in columns bysplit
NaN
s in columns byrec
stack
by first levelreset_index
for columns fromMultiIndex
Here's another approach that should work, but uses repeated
melt
s.You mention R in your comments.
melt
from "data.table" should be able to handle this much more easily since you can melt multiple sets of columns at once, similar to how you might approach the problem with base R'sreshape
function.The base R approach might be something along the lines of: