Propagate pandas series metadata through joins

2019-02-07 09:16发布

问题:

I'd like to be able attach metadata to the series of dataframes (specifically, the original filename), so that after joining two dataframes I can see metadata on where each of the series came from.

I see github issues regarding _metadata (here, here), including some relating to the current _metadata attribute (here), but nothing in the pandas docs.

So far I can modify the _metadata attribute to supposedly allow preservation of metadata, but get an AttributeError after the join.

df1 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df2 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df1._metadata.append('filename')
df1[df1.columns[0]]._metadata.append('filename')

for c in df1:
    df1[c].filename = 'fname1.csv'
    df2[c].filename = 'fname2.csv'

df1[0]._metadata  # ['name', 'filename']
df1[0].filename  # fname1.csv
df2[0].filename  # fname2.csv
df1[0][:3].filename  # fname1.csv

mgd = pd.merge(df1, df2, on=[0])
mgd['1_x']._metadata  # ['name', 'filename']
mgd['1_x'].filename  # raises AttributeError

Any way to preserve this?

Update: Epilogue

As discussed here, __finalize__ cannot keep track of Series that are members of a dataframe, only independent series. So for now I'll keep track of the Series-level metadata by maintaining a dictionary of metadata attached to the dataframes. My code looks like:

def cust_merge(d1, d2):
    "Custom merge function for 2 dicts"
    ...

def finalize_df(self, other, method=None, **kwargs):
    for name in self._metadata:
        if method == 'merge':
            lmeta = getattr(other.left, name, {})
            rmeta = getattr(other.right, name, {})
            newmeta = cust_merge(lmeta, rmeta)
            object.__setattr__(self, name, newmeta)
        else:
            object.__setattr__(self, name, getattr(other, name, None))
    return self

df1.filenames = {c: 'fname1.csv' for c in df1}
df2.filenames = {c: 'fname2.csv' for c in df2}
pd.DataFrame._metadata = ['filenames']
pd.DataFrame.__finalize__ = finalize_df

回答1:

I think something like this will work (and if not, pls file a bug report as this, while supported is a bit bleading edge, iow it IS possible that the join methods don't call this all the time. That is a bit untested).

See this issue for a more detailed example/bug fix.

DataFrame._metadata = ['name','filename']


def __finalize__(self, other, method=None, **kwargs):
    """
    propagate metadata from other to self

    Parameters
    ----------
    other : the object from which to get the attributes that we are going
        to propagate
    method : optional, a passed method name ; possibly to take different
        types of propagation actions based on this

    """

    ### you need to arbitrate when their are conflicts

    for name in self._metadata:
        object.__setattr__(self, name, getattr(other, name, None))
    return self

    DataFrame.__finalize__ = __finalize__

So this replaces the default finalizer for DataFrame with your custom one. Where I have indicated, you need to put some code which can arbitrate between conflicts. This is the reason this is not done by default, e.g. frame1 has name 'foo' and frame2 has name 'bar', what do you do when the method is __add__, what about another method?. Let us know what you do and how it works out.

This is ONLY replacing for DataFrame (and you can simply do the default action if you want), which is to propogate other to self; you can also not set anything except under special cases of method.

This method is meant to be overriden if sub-classes, that's why you are monkey patching here (rather than sub-classing which is most of the time overkill).