Pandas to_hdf
succeeds but then read_hdf
fails when I use custom objects as column headers (I use custom objects because I need to store other info in them).
Is there some way to make this work? Or is this just a Pandas bug or PyTables bug?
As an example, below, I will show first making a DataFrame foo
that uses string column headers, and everything works fine with to_hdf
/read_hdf
, but then changing foo to use a custom Col
class for column headers, to_hdf
still works fine but then read_hdf
raises assertion error:
In [48]: foo = pd.DataFrame(np.random.randn(2, 3), columns = ['aaa', 'bbb', 'ccc'])
In [49]: foo
Out[49]:
aaa bbb ccc
0 -0.434303 0.174689 1.373971
1 -0.562228 0.862092 -1.361979
In [50]: foo.to_hdf('foo.h5', 'foo')
In [51]: bar = pd.read_hdf('foo.h5', 'foo')
In [52]: bar
Out[52]:
aaa bbb ccc
0 -0.434303 0.174689 1.373971
1 -0.562228 0.862092 -1.361979
In [52]:
In [53]: class Col(object):
...: def __init__(self, name, other_info):
...: self.name = name
...: self.other_info = other_info
...: def __str__(self):
...: return self.name
...:
In [54]: foo = pd.DataFrame(np.random.randn(2, 3), columns = [Col('aaa', {'z': 5}), Col('bbb', {'y': True}), Col('ccc', {})])
In [55]: foo
Out[55]:
aaa bbb ccc
0 -0.830503 1.066178 1.057349
1 0.406967 -0.131430 1.970204
In [56]: foo.to_hdf('foo.h5', 'foo')
In [57]: bar = pd.read_hdf('foo.h5', 'foo')
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-57-888b061a1d2c> in <module>()
----> 1 bar = pd.read_hdf('foo.h5', 'foo')
/.../python3.4/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, **kwargs)
330
331 try:
--> 332 return store.select(key, auto_close=auto_close, **kwargs)
333 except:
334 # if there is an error, close the store
/.../python3.4/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
672 auto_close=auto_close)
673
--> 674 return it.get_result()
675
676 def select_as_coordinates(
/.../python3.4/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
1366
1367 # directly return the result
-> 1368 results = self.func(self.start, self.stop, where)
1369 self.close()
1370 return results
/.../python3.4/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
665 return s.read(start=_start, stop=_stop,
666 where=_where,
--> 667 columns=columns, **kwargs)
668
669 # create the iterator
/.../python3.4/site-packages/pandas/io/pytables.py in read(self, **kwargs)
2792 blocks.append(blk)
2793
-> 2794 return self.obj_type(BlockManager(blocks, axes))
2795
2796 def write(self, obj, **kwargs):
/.../python3.4/site-packages/pandas/core/internals.py in __init__(self, blocks, axes, do_integrity_check, fastpath)
2180 self._consolidate_check()
2181
-> 2182 self._rebuild_blknos_and_blklocs()
2183
2184 def make_empty(self, axes=None):
/.../python3.4/site-packages/pandas/core/internals.py in _rebuild_blknos_and_blklocs(self)
2271
2272 if (new_blknos == -1).any():
-> 2273 raise AssertionError("Gaps in blk ref_locs")
2274
2275 self._blknos = new_blknos
AssertionError: Gaps in blk ref_locs
UPDATE:
So Jeff answered (a) "this is not supported" and (b) "if you have meta-data then write it to the attributes".
Question 1 regarding (a): My column header objects have methods to return their properties, etc. For example, instead of a column header string 'x5y3z8' where I would have to parse out the values, I can simply do col_header.x (gives 5) col_header.y (gives 3) etc. This is very object-oriented and pythonic, instead of using a string to store info and having to parse it every time to retrieve info. How do you suggest I replace my current column header objects in a nice way (that's also supported)?
(BTW, you might look at 'x5y3z8' and think hierarchical index works, but that is not the case because not every column header is 'x#y#z#'. I might have one column 'foo' of strings, another one 'bar5baz7' of ints, and another 'x5y3z8' of floats. The column headers aren't uniform.)
Question 2 regarding (a): When you say it's not supported, are you specifically talking about to_hdf/read_hdf not supporting it, or are you actually saying that Pandas in general doesn't support it? If it's only the HDF5 support that's missing, then I could switch to some other way of saving the DataFrames to disk and have it work, right? Do you foresee any problems with that in the future? Will this ever break with to_pickle/read_pickle, for example? (I lose performance, but got to give up something, right?)
Question 3 regarding (b): What do you mean by "if you have meta-data then write it to the attributes". Attributes of what? A simple example would help me a lot. I'm pretty new to Pandas. Thanks!
This is not a supported feature.
This will raise in the next version of pandas (on the writing), for
format='table'
. Should forfixed
as well, but that's not implemented. This is simply not supported, nor likely to be. You should just use strings. If you have meta-data then write it to the attributes.