I'm trying to find a generic way of creating (possibly deeply) nested dictionaries from a flat Pandas DataFrame instance.
Suppose I have the following DataFrame:
dat = pd.DataFrame({'name' : ['John', 'John', 'John', 'John', 'Henry', 'Henry'],
'age' : [24, 24, 24, 24, 31, 31],
'gender' : ['Male','Male','Male','Male','Male','Male'],
'study' : ['Mathematics', 'Mathematics', 'Mathematics', 'Philosophy', 'Physics', 'Physics'],
'course' : ['Calculus 101', 'Calculus 101', 'Calculus 102', 'Aristotelean Ethics', 'Quantum mechanics', 'Quantum mechanics'],
'test' : ['Exam', 'Essay','Exam','Essay', 'Exam1','Exam2'],
'pass' : [True, True, True, True, True, True],
'grade' : ['A', 'A', 'B', 'A', 'C', 'C']})
dat = dat[['name', 'age', 'gender', 'study', 'course', 'test', 'grade', 'pass']] #re-order columns to better reflect data structure
I want to create a deeply nested dictionary (or list of nested dictionaries), that 'respects' the underlying structure of this data. That is, a grade is information about a test, which is part of a course, which is part of a study, that a person does. Also, age and gender are information about that same person.
An example desired output is this:
[{'John': {'age': 24,
'gender': 'Male',
'study': {'Mathematics': {'Calculus 101': {'Exam': {'grade': 'B',
'pass': True}}},
'Philosophy': {'Aristotelean Ethics': {'Essay': {'grade': 'A',
'pass': True}}}}}},
{'Henry': {'age': 31,
'gender': 'Male',
'study': {'Physics': {'Quantum mechanics': {'Exam1': {'Grade': 'C',
'Pass': True},
'Exam2': {'Grade': 'C',
'Pass': True}}}}}}]
(although there may be other, similar ways to structure such data).
I tried using groupby, which makes it easy, for example, to nest 'grade' and 'pass' under 'test', nest 'test' under 'course', nest 'course' under 'study', and 'study' under 'name'. But, then I don't see how to add 'gender' and 'age' under 'name' as well? Something like this is the best I came up with:
dic = {}
for ind, row in dat.groupby(['name', 'study', 'course', 'test'])['grade', 'pass']:
#this is ugly and not very generic, but just as an example
if not ind[0] in dic:
dic[ind[0]] = {}
if not ind[1] in dic[ind[0]]:
dic[ind[0]][ind[1]] = {}
if not ind[2] in dic[ind[0]][ind[1]]:
dic[ind[0]][ind[1]][ind[2]] = {}
if not ind[3] in dic[ind[0]][ind[1]][ind[2]]:
dic[ind[0]][ind[1]][ind[2]][ind[3]] = {}
dic[ind[0]][ind[1]][ind[2]][ind[3]]['grade'] = row['grade'].values[0]
dic[ind[0]][ind[1]][ind[2]][ind[3]]['pass'] = row['pass'].values[0]
But in this case, 'age' and 'gender' are not nested under 'name'. I can't seem to wrap my head around how to do this...
Another option is to set a MultiIndex and make a .to_dict('index') call. But then again, I don't see how I can nest both dicts and non-dicts under a single key...
My question is similar to this one: Convert pandas DataFrame to a nested dict, but I'm looking for a more complex nesting (e.g., not just one last column which should be nested under all other columns). Most other questions on Stackoverflow ask for the reverse: creating a (possibly MultiIndex) DataFrame from a deeply nested dictionary.
Edit: The question is also similar to this q: Pandas convert Dataframe to Nested Json, but in that question, only the last column (e.g., column n) should be nested under all other columns (n-1, n-2 etc; fully recursive nesting). In my question, column n and n-1 should be nested under n-2, but column n-2 and n-3 should be nested under n-4 (thus, importantly, n-2 is not nested under n-3 but under n-4). The MultiIndex partial solution offered by Mohammad Yusuf Ghazi depicts the structure nicely.