Guided by this answer I started to build up pipe for processing columns of dataframe based on its dtype. But after getting some unexpected output and some debugging i ended up with test dataframe and test dtype checking:
# Creating test dataframe
test = pd.DataFrame({'bool' :[False, True], 'int':[-1,2],'float': [-2.5, 3.4],
'compl':np.array([1-1j, 5]),
'dt' :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
'td' :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
'prd' :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
'str' :['s1', 's2'],
'cat' :[1, -1],
'obj' :[[1,2,3], [5435,35,-52,14]]
})
test['cat'] = test['cat'].astype('category')
test
test.dtypes
# Testing types
types = list(test.columns)
df_types = pd.DataFrame(np.zeros((len(types),len(types)), dtype=bool),
index = ['is_'+el for el in types],
columns = types)
for col in test.columns:
df_types.at['is_bool', col] = pd.api.types.is_bool_dtype(test[col])
df_types.at['is_int' , col] = pd.api.types.is_integer_dtype(test[col])
df_types.at['is_float',col] = pd.api.types.is_float_dtype(test[col])
df_types.at['is_compl',col] = pd.api.types.is_complex_dtype(test[col])
df_types.at['is_dt' , col] = pd.api.types.is_datetime64_dtype(test[col])
df_types.at['is_td' , col] = pd.api.types.is_timedelta64_dtype(test[col])
df_types.at['is_prd' , col] = pd.api.types.is_period_dtype(test[col])
df_types.at['is_intrv',col] = pd.api.types.is_interval_dtype(test[col])
df_types.at['is_str' , col] = pd.api.types.is_string_dtype(test[col])
df_types.at['is_cat' , col] = pd.api.types.is_categorical_dtype(test[col])
df_types.at['is_obj' , col] = pd.api.types.is_object_dtype(test[col])
# Styling func
def coloring(df):
clr_g = 'color : green'
clr_r = 'color : red'
mask = ~np.logical_xor(df.values, np.eye(df.shape[0], dtype=bool))
# OUTPUT
return pd.DataFrame(np.where(mask, clr_g, clr_r),
index = df.index,
columns = df.columns)
# OUTPUT colored
df_types.style.apply(coloring, axis=None)
bool bool
int int64
float float64
compl complex128
dt datetime64[ns]
td timedelta64[ns]
prd period[D]
intrv interval[float64]
str object
cat category
obj object
Almost everything is good, but this test code produces two questions:
- The most strange here is that
pd.api.types.is_string_dtype
fires oncategory
dtype. Why is that? Should it be treated as 'expected' behavior? - Why
is_string_dtype
andis_object_dtype
fires on each other? This is a bit expected, because even in.dtypes
both types are noted asobject
, but it would be better if someone clarify it step by step.
P.s.: Bonus question - am i right when thinking that pandas has its internal tests that should be passed when building new release (like df_types from test code, but not with 'coloring in red' rather 'recording info about errors')?
EDIT: pandas version 0.24.2
.
This comes down to
is_string_dtype
being a fairly loose check, with the implementation even having a TODO note to make it more strict, linking to Issue #15585.The reason this check is not strict is because there isn't a dedicated string dtype in
pandas
, and instead strings are just stored withobject
dtype, which could really store anything. As such, a more strict check would likely introduce a performance overhead.To answer your questions:
This is a result of
CategoricalDtype.kind
being set to'O'
, which is one of the loose checksis_string_dtype
does. This could probably change in the future given the TODO note, so it's not something I'd rely upon.Since strings are stored as
object
dtype it makes sense foris_object_dtype
to fire on strings, and I'd consider this behavior to be reliable as the implementation will almost certainly not change in the immediate future. The reverse is true due to the reliance ondtype.kind
inis_string_dtype
, which has the same caveats as with categoricals described above.Yes,
pandas
has a test suite that will run automatically on various CI services for every PR that's created. The test suite includes checks similar to what you're doing.One tangentially related note to add: there is a library called
fletcher
that uses Apache Arrow to implement a more native string type in a way that's compatible withpandas
. It's still under development and probably doesn't currently have support for all the string operations thatpandas
does.