I'm looking for a way to use pandas and python to combine several columns in an excel sheet with known column names into a new, single one, keeping all the important information as in the example below:
input:
ID,tp_c,tp_b,tp_p
0,transportation - cars,transportation - boats,transportation - planes
1,checked,-,-
2,-,checked,-
3,checked,checked,-
4,-,checked,checked
5,checked,checked,checked
desired output:
ID,tp_all
0,transportation
1,cars
2,boats
3,cars+boats
4,boats+planes
5,cars+boats+planes
The row with ID of 0 contans a description of the contents of the column. Ideally the code would parse the description in the second row, look after the '-' and concatenate those values in the new "tp_all" column.
OK a more dynamic method:
Here is one way:
Then:
You can create a new DataFrame with this column or do what you like with it.
Edit: I see that you have edited your question so that the names of the modes of transportation are now in row 0 instead of in the column headers. It is easier if they're in the column headers (as my answer assumes), and your new column headers don't seem to contain any additional useful information, so you should probably start by just setting the column names to the info from row 0, and deleting row 0.
This is quite interesting as it's a reverse
get_dummies
...I think I would manually munge the column names so that you have a boolean DataFrame:
Now you can use an apply with zip:
Now you just have to tweak the headers, to get the desired csv.
Would be nice if there were a less manual way / faster to do reverse
get_dummies
...