Let's say we have issue tracker logs and we want to find out issues owners (guys who logged the most time to the issue)
- User can log time multiple times to the same issue
- If 2 users log the same time, the are both owners
So we have some sample data:
df = pd.DataFrame([
[1, 10, 'John'],
[1, 20, 'John'],
[1, 30, 'Tom'],
[1, 10, 'Bob'],
[2, 25, 'John'],
[2, 15, 'Bob']], columns = ['IssueKey','TimeSpent','User'])
As the output we want something like this:
issues_owners = pd.DataFrame([
[1, 30, 'John'],
[1, 30, 'Tom'],
[2, 25, 'John']], columns = ['IssueKey','TimeSpent','User'])
- Both John and Tom are owners of issue
1
, as they both spent 30 minutes on it. - John actually worked on issue
1
on 2 separate days - John is also the owner of the issue
2
- Bob is lazy and doesn't own any issues :)
What I came up with feels quite disgusting (I'm relatively new to Python):
df = df.groupby(['IssueKey', 'User']).sum().reset_index()
maxTimesPerIssue = df.groupby('IssueKey')['TimeSpent'].max().reset_index()
maxTimesPerIssue = dict(zip(maxTimesPerIssue['IssueKey'], maxTimesPerIssue['TimeSpent']))
df['MaxTimePerIssue'] = [maxTimesPerIssue[key] for key in df['IssueKey']]
df = df[df.MaxTimePerIssue == df.TimeSpent]
df = df.drop(columns=['MaxTimePerIssue'])
What I dislike about my Python code:
maxTimesPerIssue
appears in the middle of processing thedf
disrupting the thought process (or pipeline)- The creation of
maxTimesPerIssue
itself is kind of messy - Adding
MaxTimePerIssue
thedf
- It's definitely way less self-explanatory than the C# version, due to using lots of low level stuff like:
reset_index()
,list()
,dict()
, list comprehensions, dropping columns
Can anybody help me clean it up?