Add a sequence number to each element in a group u

2020-02-08 16:22发布

I have a dataframe of individuals who each have multiple records. I want to enumerate the record in the sequence for each individual in python. Essentially I would like to create the 'sequence' column in the following table:

patient  date      sequence
145      20Jun2009        1
145      24Jun2009        2
145      15Jul2009        3
582      09Feb2008        1
582      21Feb2008        2
987      14Mar2010        1
987      02May2010        2
987      12May2010        3

This is essentially the same question as here, but I am working in python and unable to implement the sql solution. I suspect I can use a groupby statement with an iterable count, but have so far been unsuccessful. Thanks!

3条回答
神经病院院长
2楼-- · 2020-02-08 16:41

The question is how do I sort on multiple columns of data.

One simple trick is to use the key parameter to the sorted function.

You'll be sorting by a string built from the columns of the array.

rows = ...# your source data

def date_to_sortable_string(date):
  # use datetime package to convert string to sortable date.
  pass

# Assume x[0] === patient_id and x[1] === encounter date

# Sort by patient_id and date
rows_sorted = sorted(rows, key=lambda x: "%0.5d-%s" % (x[0], date_to_sortable_string(x[1])))

for row in rows_sorted:
  print row
查看更多
唯我独甜
3楼-- · 2020-02-08 16:41

Firstly you want to convert the date column to be a pandas datetime (rather than strings):

In [11]: pd.to_datetime(df['date'], format='%d%b%Y')
Out[11]:
0   2009-06-20
1   2009-06-24
2   2009-07-15
3   2008-02-09
4   2008-02-21
5   2010-03-14
6   2010-05-02
7   2010-05-12
Name: date, dtype: datetime64[ns]

Note: see docs for possible format options.

In [12]: df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')

In [13]: df
Out[13]:
   patient       date  sequence
0      145 2009-06-20         1
1      145 2009-06-24         2
2      145 2009-07-15         3
3      582 2008-02-09         1
4      582 2008-02-21         2
5      987 2010-03-14         1
6      987 2010-05-02         2
7      987 2010-05-12         3

If this isn't in date order (for each patient), I would sort it first:

In [14]: df = df.sort('date')

Now you can groupby and cumcount:

In [15]: g = df.groupby('patient')

In [16]: g.cumcount() + 1
Out[16]:
2    1
3    2
0    1
1    2
4    1
5    2
6    3
dtype: int64

Which is what you want (althout it's out of order):

In [17]: df['sequence'] = g.cumcount() + 1

In [18]: df
Out[18]:
       patient       date  sequence
2      582 2008-02-09         1
3      582 2008-02-21         2
0      145 2009-06-24         1
1      145 2009-07-15         2
4      987 2010-03-14         1
5      987 2010-05-02         2
6      987 2010-05-12         3

To rearrange (though you may not need to) use sort_index (or we could reindex if we saved the initial DataFrame's index):*

In [19]: df.sort_index()
Out[19]:
   patient       date  sequence
0      145 2009-06-24         1
1      145 2009-07-15         2
2      582 2008-02-09         1
3      582 2008-02-21         2
4      987 2010-03-14         1
5      987 2010-05-02         2
6      987 2010-05-12         3
查看更多
倾城 Initia
4楼-- · 2020-02-08 16:48

I stumbled upon the answer which was embarrassingly simple. The groupby statement has a 'cumcount()' option which will enumerate group items.

df['sequence']=df.groupby('patient').cumcount()

The caveat is that the records have to be in the order you want them enumerated.

查看更多
登录 后发表回答