High-dimensional data structure in Python

2019-05-06 13:00发布

问题:

What is best way to store and analyze high-dimensional date in python? I like Pandas DataFrame and Panel where I can easily manipulate the axis. Now I have a hyper-cube (dim >=4) of data. I have been thinking of stuffs like dict of Panels, tuple as panel entries. I wonder if there is a high-dim panel thing in Python.

update 20/05/16: Thanks very much for all the answers. I have tried MultiIndex and xArray, however I am not able to comment on any of them. In my problem I will try to use ndarray instead as I found the label is not essential and I can save it separately.

update 16/09/16: I came up to use MultiIndex in the end. The ways to manipulate it are pretty tricky at first, but I kind of get used to it now.

回答1:

MultiIndex is most useful for higher dimensional data as explained in the docs and this SO answer because it allows you to work with any number of dimension in a DataFrame environment.

In addition to the Panel, there is also Panel4D - currently in experimental stage. Given the advantages of MultiIndex I wouldn't recommend using either this or the three dimensional version. I don't think these data structures have gained much traction in comparison, and will indeed be phased out.



回答2:

If you need labelled arrays and pandas-like smart indexing, you can use xarray package which is essentially an n-dimensional extension of pandas Panel (panels are being deprecated in pandas in future in favour of xarray).

Otherwise, it may sometimes be reasonable to use plain numpy arrays which can be of any dimensionality; you can also have arbitrarily nested numpy record arrays of any dimension.



回答3:

I recommend continuing to use DataFrame but utilize the MultiIndex feature. DataFrame is better supported and you preserve all of your dimensionality with the MultiIndex.

Example

df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'], index=['A', 'B'])

df3 = pd.concat([df for _ in [0, 1]], keys=['one', 'two'])

df4 = pd.concat([df3 for _ in [0, 1]], axis=1, keys=['One', 'Two'])

print df4

Looks like:

      One    Two   
        a  b   a  b
one A   1  2   1  2
    B   3  4   3  4
two A   1  2   1  2
    B   3  4   3  4

This is a hyper-cube of data. And you'll be much better served with support and questions and less bugs and many other benefits.