Transforming multiindex to row-wise multi-dimensio

2020-08-22 04:57发布

问题:

Suppose I have a MultiIndex DataFrame similar to an example from the MultiIndex docs.

>>> df 
               0   1   2   3
first second                
bar   one      0   1   2   3
      two      4   5   6   7
baz   one      8   9  10  11
      two     12  13  14  15
foo   one     16  17  18  19
      two     20  21  22  23
qux   one     24  25  26  27
      two     28  29  30  31

I want to generate a NumPy array from this DataFrame with a 3-dimensional structure like

>>> desired_arr
array([[[ 0,  4],
        [ 1,  5],
        [ 2,  6],
        [ 3,  7]],

       [[ 8, 12],
        [ 9, 13],
        [10, 14],
        [11, 15]],

       [[16, 20],
        [17, 21],
        [18, 22],
        [19, 23]],

       [[24, 28],
        [25, 29],
        [26, 30],
        [27, 31]]])

How can I do so?

Hopefully it is clear what is happening here - I am effectively unstacking the DataFrame by the first level and then trying to turn each top level in the resulting column MultiIndex to its own 2-dimensional array.

I can get half way there with

>>> df.unstack(1)
         0       1       2       3    
second one two one two one two one two
first                                 
bar      0   4   1   5   2   6   3   7
baz      8  12   9  13  10  14  11  15
foo     16  20  17  21  18  22  19  23
qux     24  28  25  29  26  30  27  31

but then I am struggling to find a nice way to turn each column into a 2-dimensional array and then join them together, beyond doing so explicitly with loops and lists.

I feel like there should be some way for me to specify the shape of my desired NumPy array beforehand, fill it with np.nan and then use a specific iterating order to fill the values with my DataFrame, but I have not managed to solve the problem with this approach yet .


To generate the sample DataFrame

iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
ind = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.arange(8*4).reshape((8, 4)), index=ind)

回答1:

Some reshape and swapaxes magic -

df.values.reshape(4,2,-1).swapaxes(1,2)

Generalizable to -

m,n = len(df.index.levels[0]), len(df.index.levels[1])
arr = df.values.reshape(m,n,-1).swapaxes(1,2)

Basically splitting the first axis into two of lengths 4 and 2 creating a 3D array and then swapping the last two axes, i.e. pushing in the axis of length 2 to the back (as the last one).

Sample output -

In [35]: df.values.reshape(4,2,-1).swapaxes(1,2)
Out[35]: 
array([[[ 0,  4],
        [ 1,  5],
        [ 2,  6],
        [ 3,  7]],

       [[ 8, 12],
        [ 9, 13],
        [10, 14],
        [11, 15]],

       [[16, 20],
        [17, 21],
        [18, 22],
        [19, 23]],

       [[24, 28],
        [25, 29],
        [26, 30],
        [27, 31]]])