Filtering DataFrame based on its groups properties

2019-07-23 06:28发布

问题:

Let's say we have issue tracker logs and we want to find out issues owners (guys who logged the most time to the issue)

  1. User can log time multiple times to the same issue
  2. If 2 users log the same time, the are both owners

So we have some sample data:

df = pd.DataFrame([
        [1, 10, 'John'],
        [1, 20, 'John'],
        [1, 30, 'Tom'],
        [1, 10, 'Bob'],
        [2, 25, 'John'],
        [2, 15, 'Bob']], columns = ['IssueKey','TimeSpent','User'])

As the output we want something like this:

issues_owners = pd.DataFrame([
        [1, 30, 'John'],
        [1, 30, 'Tom'],
        [2, 25, 'John']], columns = ['IssueKey','TimeSpent','User'])
  1. Both John and Tom are owners of issue 1, as they both spent 30 minutes on it.
  2. John actually worked on issue 1 on 2 separate days
  3. John is also the owner of the issue 2
  4. Bob is lazy and doesn't own any issues :)

What I came up with feels quite disgusting (I'm relatively new to Python):

df = df.groupby(['IssueKey', 'User']).sum().reset_index()
maxTimesPerIssue = df.groupby('IssueKey')['TimeSpent'].max().reset_index()
maxTimesPerIssue = dict(zip(maxTimesPerIssue['IssueKey'], maxTimesPerIssue['TimeSpent']))
df['MaxTimePerIssue'] = [maxTimesPerIssue[key] for key in df['IssueKey']]
df = df[df.MaxTimePerIssue == df.TimeSpent]
df = df.drop(columns=['MaxTimePerIssue'])   

What I dislike about my Python code:

  1. maxTimesPerIssue appears in the middle of processing the df disrupting the thought process (or pipeline)
  2. The creation of maxTimesPerIssue itself is kind of messy
  3. Adding MaxTimePerIssue the df
  4. It's definitely way less self-explanatory than the C# version, due to using lots of low level stuff like: reset_index(), list(), dict(), list comprehensions, dropping columns

Can anybody help me clean it up?

回答1:

Something along the lines of a groupby will work for your data:

i = df.groupby(['IssueKey', 'User']).TimeSpent.sum()
j = i.groupby(level=0).transform('max')

i[i == j].reset_index()

   IssueKey  User  TimeSpent
0         1  John         30
1         1   Tom         30
2         2  John         25