I'm trying to find a generic way of creating (possibly deeply) nested dictionaries from a flat Pandas DataFrame instance.
Suppose I have the following DataFrame:
dat = pd.DataFrame({'name' : ['John', 'John', 'John', 'John', 'Henry', 'Henry'],
'age' : [24, 24, 24, 24, 31, 31],
'gender' : ['Male','Male','Male','Male','Male','Male'],
'study' : ['Mathematics', 'Mathematics', 'Mathematics', 'Philosophy', 'Physics', 'Physics'],
'course' : ['Calculus 101', 'Calculus 101', 'Calculus 102', 'Aristotelean Ethics', 'Quantum mechanics', 'Quantum mechanics'],
'test' : ['Exam', 'Essay','Exam','Essay', 'Exam1','Exam2'],
'pass' : [True, True, True, True, True, True],
'grade' : ['A', 'A', 'B', 'A', 'C', 'C']})
dat = dat[['name', 'age', 'gender', 'study', 'course', 'test', 'grade', 'pass']] #re-order columns to better reflect data structure
I want to create a deeply nested dictionary (or list of nested dictionaries), that 'respects' the underlying structure of this data. That is, a grade is information about a test, which is part of a course, which is part of a study, that a person does. Also, age and gender are information about that same person.
An example desired output is this:
[{'John': {'age': 24,
'gender': 'Male',
'study': {'Mathematics': {'Calculus 101': {'Exam': {'grade': 'B',
'pass': True}}},
'Philosophy': {'Aristotelean Ethics': {'Essay': {'grade': 'A',
'pass': True}}}}}},
{'Henry': {'age': 31,
'gender': 'Male',
'study': {'Physics': {'Quantum mechanics': {'Exam1': {'Grade': 'C',
'Pass': True},
'Exam2': {'Grade': 'C',
'Pass': True}}}}}}]
(although there may be other, similar ways to structure such data).
I tried using groupby, which makes it easy, for example, to nest 'grade' and 'pass' under 'test', nest 'test' under 'course', nest 'course' under 'study', and 'study' under 'name'. But, then I don't see how to add 'gender' and 'age' under 'name' as well? Something like this is the best I came up with:
dic = {}
for ind, row in dat.groupby(['name', 'study', 'course', 'test'])['grade', 'pass']:
#this is ugly and not very generic, but just as an example
if not ind[0] in dic:
dic[ind[0]] = {}
if not ind[1] in dic[ind[0]]:
dic[ind[0]][ind[1]] = {}
if not ind[2] in dic[ind[0]][ind[1]]:
dic[ind[0]][ind[1]][ind[2]] = {}
if not ind[3] in dic[ind[0]][ind[1]][ind[2]]:
dic[ind[0]][ind[1]][ind[2]][ind[3]] = {}
dic[ind[0]][ind[1]][ind[2]][ind[3]]['grade'] = row['grade'].values[0]
dic[ind[0]][ind[1]][ind[2]][ind[3]]['pass'] = row['pass'].values[0]
But in this case, 'age' and 'gender' are not nested under 'name'. I can't seem to wrap my head around how to do this...
Another option is to set a MultiIndex and make a .to_dict('index') call. But then again, I don't see how I can nest both dicts and non-dicts under a single key...
My question is similar to this one: Convert pandas DataFrame to a nested dict, but I'm looking for a more complex nesting (e.g., not just one last column which should be nested under all other columns). Most other questions on Stackoverflow ask for the reverse: creating a (possibly MultiIndex) DataFrame from a deeply nested dictionary.
Edit: The question is also similar to this q: Pandas convert Dataframe to Nested Json, but in that question, only the last column (e.g., column n) should be nested under all other columns (n-1, n-2 etc; fully recursive nesting). In my question, column n and n-1 should be nested under n-2, but column n-2 and n-3 should be nested under n-4 (thus, importantly, n-2 is not nested under n-3 but under n-4). The MultiIndex partial solution offered by Mohammad Yusuf Ghazi depicts the structure nicely.
This is partial answer. I don't know how to convert the index to json.
Output:
Not really concise, but it's the best I can get now:
The idea is to roll up data to dictionaries while grouping data to get 'study' column
update I've tried to create more generic solution, so it'd work for question like this one as well: