Subclassing pandas classes seems a common need but I could not find references on the subject. (It seems that pandas developers are still working on it: https://github.com/pydata/pandas/issues/60).
There are some SO threads on the subject, but I am hoping that someone here can provide a more systematic account on currently the best way to subclass pandas.DataFrame that satisfies two, I think, general requirements:
import numpy as np
import pandas as pd
class MyDF(pd.DataFrame):
# how to subclass pandas DataFrame?
pass
mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf) # <class '__main__.MyDF'>
# Requirement 1: Instances of MyDF, when calling standard methods of DataFrame,
# should produce instances of MyDF.
mydf_sub = mydf[['A','C']]
print type(mydf_sub) # <class 'pandas.core.frame.DataFrame'>
# Requirement 2: Attributes attached to instances of MyDF, when calling standard
# methods of DataFrame, should still attach to the output.
mydf.myattr = 1
mydf_cp1 = MyDF(mydf)
mydf_cp2 = mydf.copy()
print hasattr(mydf_cp1, 'myattr') # False
print hasattr(mydf_cp2, 'myattr') # False
And is there any significant differences for subclassing pandas.Series? Thank you.
There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series.
The guide is available here: http://pandas.pydata.org/pandas-docs/stable/internals.html#subclassing-pandas-data-structures
The guide mentions this subclassed DataFrame from the Geopandas project as a good example: https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py
As in HYRY's answer, it seems there are two things you're trying to accomplish:
_constructor
property which should return your type._metadata
attribute.Here's an example:
For Requirement 1, just define
_constructor
:I think there is no simple solution for Requirement 2, I think you need define
__init__
,copy
, or do something in_constructor
, for example: