I have a pandas DataFrame which details online activities in terms of "clicks" during an user session. There are as many as 50,000 unique users, and the dataframe has around 1.5 million samples. Obviously most users have multiple records.
The four columns are a unique user id, the date when the user began the service "Registration", the date the user used the service "Session", the total number of clicks.
The organization of the dataframe is as follows:
User_ID Registration Session clicks
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
2234214 2012-07-22 2014-01-22 7
9874452 2010-12-22 2014-08-22 2
...
(There is also an index above beginning with 0, but one could set User_ID
as the index.)
I would like to aggregate the total number of clicks by the user since Registration date. The dataframe (or pandas Series object) would list User_ID and "Total_Number_Clicks".
User_ID Total_Clicks
2349876 722
1987293 341
2234214 220
9874452 1405
...
How does one do this in pandas? Is this done by .agg()
? Each User_ID
needs to be summed individually.
As there are 1.5 million records, does this scale?
The first thing to do is filter registrations dates that precede the registration date, then group on the User_ID and sum.
For the use case you mentioned, I believe this is scalable. It always depends, of course, on your available memory.
IIUC you can use
groupby
,sum
andreset_index
:If first column
User_ID
isindex
:Or:
EDIT:
As Alexander pointed, you need filter data before
groupby
, ifSession
dates is less asRegistration
dates perUser_ID
:I change 3. row of data for better sample:
suppose your dataframe name is df, then do the following