I have an Excel file with some (mostly) nicely grouped rows. I built a fake example below.
Is there a way to get read_excel in Pandas to produce a multiindex preserving this structure?
For this example the MultiIndex would have four levels (Family, Individual, Child (optional), investment). If the subtotal values were lost that would be fine as they can easily be recreated in Pandas.
No, pandas
can't read such a structure.
An alternative solution is to use pandas
to read your data, but transform this into an easily accessible dictionary, rather than keeping your data in a dataframe with MultiIndex
.
There are 2 sensible requirements to make your data more usable:
- Make your investment fund names unique. This is trivial.
- Convert your Excel grouping to an additional column which indicates the parent of the row.
In the below example, these 2 requirements are assumed.
Setup
from collections import defaultdict
from functools import reduce
import operator
import pandas as pd
df = pd.DataFrame({'name': ['Simpson Family', 'Marge Simpson', 'Maggies College Fund',
'MCF Investment 2', 'MS Investment 1', 'MS Investment 2', 'MS Investment 3',
'Homer Simpson', 'HS Investment 1', 'HS Investment 3', 'HS Investment 2',
'Griffin Family', 'Lois Griffin', 'LG Investment 2', 'LG Investment 3',
'Brian Giffin', 'BG Investment 3'],
'Value': [600, 450, 100, 100, 100, 200, 50, 150, 100, 50, 0, 200, 150, 100, 50, 50, 50],
'parent': ['Families', 'Simpson Family', 'Marge Simpson', 'Maggies College Fund',
'Marge Simpson', 'Marge Simpson', 'Marge Simpson', 'Simpson Family',
'Homer Simpson', 'Homer Simpson', 'Homer Simpson', 'Families',
'Griffin Family', 'Lois Griffin', 'Lois Griffin', 'Griffin Family',
'Brian Giffin']})
Value name parent
0 600 Simpson Family Families
1 450 Marge Simpson Simpson Family
2 100 Maggies College Fund Marge Simpson
3 100 MCF Investment 2 Maggies College Fund
4 100 MS Investment 1 Marge Simpson
5 200 MS Investment 2 Marge Simpson
6 50 MS Investment 3 Marge Simpson
7 150 Homer Simpson Simpson Family
8 100 HS Investment 1 Homer Simpson
9 50 HS Investment 3 Homer Simpson
10 0 HS Investment 2 Homer Simpson
11 200 Griffin Family Families
12 150 Lois Griffin Griffin Family
13 100 LG Investment 2 Lois Griffin
14 50 LG Investment 3 Lois Griffin
15 50 Brian Giffin Griffin Family
16 50 BG Investment 3 Brian Giffin
Step 1
Define a child -> parent dictionary and some utility functions:
child_parent_dict = df.set_index('name')['parent'].to_dict()
tree = lambda: defaultdict(tree)
d = tree()
def get_all_parents(child):
"""Get all parents from hierarchy structure"""
while child != 'Families':
child = child_parent_dict[child]
if child != 'Families':
yield child
def getFromDict(dataDict, mapList):
"""Iterate nested dictionary"""
return reduce(operator.getitem, mapList, dataDict)
def default_to_regular_dict(d):
"""Convert nested defaultdict to regular dict of dicts."""
if isinstance(d, defaultdict):
d = {k: default_to_regular_dict(v) for k, v in d.items()}
return d
Step 2
Apply this to your dataframe. Use it to create a nested dictionary structure which will be more efficient for repeated queries.
df['structure'] = df['name'].apply(lambda x: ['Families'] + list(get_all_parents(x))[::-1])
for idx, row in df.iterrows():
getFromDict(d, row['structure'])[row['name']]['Value'] = row['Value']
res = default_to_regular_dict(d)
Result
Dataframe
Value name parent \
0 600 Simpson Family Families
1 450 Marge Simpson Simpson Family
2 100 Maggies College Fund Marge Simpson
3 100 MCF Investment 2 Maggies College Fund
4 100 MS Investment 1 Marge Simpson
5 200 MS Investment 2 Marge Simpson
6 50 MS Investment 3 Marge Simpson
7 150 Homer Simpson Simpson Family
8 100 HS Investment 1 Homer Simpson
9 50 HS Investment 3 Homer Simpson
10 0 HS Investment 2 Homer Simpson
11 200 Griffin Family Families
12 150 Lois Griffin Griffin Family
13 100 LG Investment 2 Lois Griffin
14 50 LG Investment 3 Lois Griffin
15 50 Brian Giffin Griffin Family
16 50 BG Investment 3 Brian Giffin
structure
0 [Families]
1 [Families, Simpson Family]
2 [Families, Simpson Family, Marge Simpson]
3 [Families, Simpson Family, Marge Simpson, Magg...
4 [Families, Simpson Family, Marge Simpson]
5 [Families, Simpson Family, Marge Simpson]
6 [Families, Simpson Family, Marge Simpson]
7 [Families, Simpson Family]
8 [Families, Simpson Family, Homer Simpson]
9 [Families, Simpson Family, Homer Simpson]
10 [Families, Simpson Family, Homer Simpson]
11 [Families]
12 [Families, Griffin Family]
13 [Families, Griffin Family, Lois Griffin]
14 [Families, Griffin Family, Lois Griffin]
15 [Families, Griffin Family]
16 [Families, Griffin Family, Brian Giffin]
Dictionary
{'Families': {'Griffin Family': {'Brian Giffin': {'BG Investment 3': {'Value': 50},
'Value': 50},
'Lois Griffin': {'LG Investment 2': {'Value': 100}, 'LG Investment 3': {'Value': 50},
'Value': 150},
'Value': 200},
'Simpson Family': {'Homer Simpson': {'HS Investment 1': {'Value': 100}, 'HS Investment 2': {'Value': 0}, 'HS Investment 3': {'Value': 50},
'Value': 150},
'Marge Simpson': {'MS Investment 1': {'Value': 100}, 'MS Investment 2': {'Value': 200}, 'MS Investment 3': {'Value': 50},
'Maggies College Fund': {'MCF Investment 2': {'Value': 100},
'Value': 100},
'Value': 450},
'Value': 600}}}
I don't think it is possible to implement this using read_excel as-it.
What you can do is to add additional columns to your excel sheet based on the four hierarchy levels (Family, Individual, Child (optional), investment) and then use read_excel()
with index_col[0,1,2,3]
to generate the pandas dataframe.
See the index_col parameter of the read_excel function.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
index_col : int, list of ints, default None
Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex. If a subset of data is selected with usecols, index_col is based on the subset.