I'd like to be able attach metadata to the series of dataframes (specifically, the original filename), so that after joining two dataframes I can see metadata on where each of the series came from.
I see github issues regarding _metadata
(here, here), including some relating to the current _metadata
attribute (here), but nothing in the pandas docs.
So far I can modify the _metadata
attribute to supposedly allow preservation of metadata, but get an AttributeError
after the join.
df1 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df2 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df1._metadata.append('filename')
df1[df1.columns[0]]._metadata.append('filename')
for c in df1:
df1[c].filename = 'fname1.csv'
df2[c].filename = 'fname2.csv'
df1[0]._metadata # ['name', 'filename']
df1[0].filename # fname1.csv
df2[0].filename # fname2.csv
df1[0][:3].filename # fname1.csv
mgd = pd.merge(df1, df2, on=[0])
mgd['1_x']._metadata # ['name', 'filename']
mgd['1_x'].filename # raises AttributeError
Any way to preserve this?
Update: Epilogue
As discussed here, __finalize__
cannot keep track of Series that are members of a dataframe, only independent series. So for now I'll keep track of the Series-level metadata by maintaining a dictionary of metadata attached to the dataframes. My code looks like:
def cust_merge(d1, d2):
"Custom merge function for 2 dicts"
...
def finalize_df(self, other, method=None, **kwargs):
for name in self._metadata:
if method == 'merge':
lmeta = getattr(other.left, name, {})
rmeta = getattr(other.right, name, {})
newmeta = cust_merge(lmeta, rmeta)
object.__setattr__(self, name, newmeta)
else:
object.__setattr__(self, name, getattr(other, name, None))
return self
df1.filenames = {c: 'fname1.csv' for c in df1}
df2.filenames = {c: 'fname2.csv' for c in df2}
pd.DataFrame._metadata = ['filenames']
pd.DataFrame.__finalize__ = finalize_df
I think something like this will work (and if not, pls file a bug report as this, while supported is a bit bleading edge, iow it IS possible that the join methods don't call this all the time. That is a bit untested).
See this issue for a more detailed example/bug fix.
So this replaces the default finalizer for DataFrame with your custom one. Where I have indicated, you need to put some code which can arbitrate between conflicts. This is the reason this is not done by default, e.g. frame1 has name 'foo' and frame2 has name 'bar', what do you do when the method is
__add__
, what about another method?. Let us know what you do and how it works out.This is ONLY replacing for DataFrame (and you can simply do the default action if you want), which is to propogate other to self; you can also not set anything except under special cases of method.
This method is meant to be overriden if sub-classes, that's why you are monkey patching here (rather than sub-classing which is most of the time overkill).