split a Pandas series without a multiindex

2019-02-15 13:40发布

问题:

I would like to take a Pandas Series with a single-level index and split on that index into a dataframe with multiple columns. For instance, for input:

s = pd.Series(range(10,17), index=['a','a','b','b','c','c','c'])

s
a    10
a    11
b    12
b    13
c    14
c    15
c    16
dtype: int64

What I would like as an output is:

    a    b    c
0   10   12   14
1   11   13   15
2   NaN  NaN  16

I cannot directly use the unstack command because it requires a multiindex and I only have a single-level index. I tried putting in a dummy index that all had the same value, but I got an error "ReshapeError: Index contains duplicate entries, cannot reshape".

I know that this is a little bit unusual because 1) pandas doesn't like ragged arrays, so there will need to be padding, 2) the index needs to be arbitrarily reset, 3) I can't really "initialize" the dataframe until I know how long the longest column is going to be. But this still seems like something that I should be able to do somehow. I also thought about doing it via groupby, but it doesn't seem like there is anything like grouped_df.values() without any kind of aggregating function- probably for the above reasons.

回答1:

You can use groupby, apply, reset_index to create a multiindex Series, and then call unstack:

import pandas as pd
s = pd.Series(range(10,17), index=['a','a','b','b','c','c','c'])
df = s.groupby(level=0).apply(pd.Series.reset_index, drop=True).unstack(0)
print df

output:

   a   b   c
0  10  12  14
1  11  13  15
2 NaN NaN  16


回答2:

Not sure how generalizable this is. I call this the groupby via concat pattern. Essentially an apply, but with control over how exactly its combined.

In [24]: s = pd.Series(range(10,17), index=['a','a','b','b','c','c','c'])

In [25]: df = DataFrame(dict(key = s.index, value = s.values))

In [26]: df
Out[26]: 
  key  value
0   a     10
1   a     11
2   b     12
3   b     13
4   c     14
5   c     15
6   c     16

In [27]: concat(dict([ (g,Series(grp['value'].values)) for g, grp in df.groupby('key') ]),axis=1)
Out[27]: 
    a   b   c
0  10  12  14
1  11  13  15
2 NaN NaN  16