normalizing data by duplication

2019-01-18 00:17发布

问题:

note: this question is indeed a duplicate of Split pandas dataframe string entry to separate rows, but the answer provided here is more generic and informative, so with all respect due, I chose not to delete the thread


I have a 'dataset' with the following format:

     id | value | ...
--------|-------|------
      a | 156   | ...
    b,c | 457   | ...
e,g,f,h | 346   | ...
    ... | ...   | ...

and I would like to normalize it by duplicating all values for each ids:

     id | value | ...
--------|-------|------
      a | 156   | ...
      b | 457   | ...
      c | 457   | ...
      e | 346   | ...
      g | 346   | ...
      f | 346   | ...
      h | 346   | ...
    ... | ...   | ...

What I'm doing is applying the split-apply-combine principle of pandas using .groupby that creates a tuple for each group (groupby value, pd.DataFrame())

I created a column to group by that simply counts the ids in the row:

df['count_ids'] = df['id'].str.split(',').apply(lambda x: len(x))

     id | value | count_ids
--------|-------|------
      a | 156   | 1
    b,c | 457   | 2
e,g,f,h | 346   | 4
    ... | ...   | ...

The way I'm duplicating the rows is as follows:

pd.DataFrame().append([group]*count_ids)

I'm slowly progressing, but it is really complex, and I would appreciate any best practice or recommendation you can share with this type of problems.

回答1:

try this:

In [44]: df
Out[44]:
        id  value
0        a    156
1      b,c    457
2  e,g,f,h    346

In [45]: (df['id'].str.split(',', expand=True)
   ....:          .stack()
   ....:          .reset_index(level=0)
   ....:          .set_index('level_0')
   ....:          .rename(columns={0:'id'})
   ....:          .join(df.drop('id',1), how='left')
   ....: )
Out[45]:
  id  value
0  a    156
1  b    457
1  c    457
2  e    346
2  g    346
2  f    346
2  h    346

Explanation:

In [48]: df['id'].str.split(',', expand=True).stack()
Out[48]:
0  0    a
1  0    b
   1    c
2  0    e
   1    g
   2    f
   3    h
dtype: object

In [49]: df['id'].str.split(',', expand=True).stack().reset_index(level=0)
Out[49]:
   level_0  0
0        0  a
0        1  b
1        1  c
0        2  e
1        2  g
2        2  f
3        2  h

In [50]: df['id'].str.split(',', expand=True).stack().reset_index(level=0).set_index('level_0')
Out[50]:
         0
level_0
0        a
1        b
1        c
2        e
2        g
2        f
2        h

In [51]: df['id'].str.split(',', expand=True).stack().reset_index(level=0).set_index('level_0').rename(columns={0:'id'})
Out[51]:
        id
level_0
0        a
1        b
1        c
2        e
2        g
2        f
2        h

In [52]: df.drop('id',1)
Out[52]:
   value
0    156
1    457
2    346