What is best way to store and analyze high-dimensional date in python? I like Pandas DataFrame and Panel where I can easily manipulate the axis. Now I have a hyper-cube (dim >=4) of data. I have been thinking of stuffs like dict of Panels, tuple as panel entries. I wonder if there is a high-dim panel thing in Python.
update 20/05/16:
Thanks very much for all the answers. I have tried MultiIndex and xArray, however I am not able to comment on any of them. In my problem I will try to use ndarray instead as I found the label is not essential and I can save it separately.
update 16/09/16:
I came up to use MultiIndex in the end. The ways to manipulate it are pretty tricky at first, but I kind of get used to it now.
MultiIndex
is most useful for higher dimensional data as explained in the docs and this SO answer because it allows you to work with any number of dimension in a DataFrame
environment.
In addition to the Panel
, there is also Panel4D - currently in experimental stage. Given the advantages of MultiIndex
I wouldn't recommend using either this or the three dimensional version. I don't think these data structures have gained much traction in comparison, and will indeed be phased out.
If you need labelled arrays and pandas-like smart indexing, you can use xarray
package which is essentially an n-dimensional extension of pandas Panel (panels are being deprecated in pandas in future in favour of xarray).
Otherwise, it may sometimes be reasonable to use plain numpy arrays which can be of any dimensionality; you can also have arbitrarily nested numpy record arrays of any dimension.
I recommend continuing to use DataFrame
but utilize the MultiIndex
feature. DataFrame
is better supported and you preserve all of your dimensionality with the MultiIndex
.
Example
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'], index=['A', 'B'])
df3 = pd.concat([df for _ in [0, 1]], keys=['one', 'two'])
df4 = pd.concat([df3 for _ in [0, 1]], axis=1, keys=['One', 'Two'])
print df4
Looks like:
One Two
a b a b
one A 1 2 1 2
B 3 4 3 4
two A 1 2 1 2
B 3 4 3 4
This is a hyper-cube of data. And you'll be much better served with support and questions and less bugs and many other benefits.