I have a dataframe looks like this after I groupby them.
gb = new_user_log.groupby(['msno', 'date', 'num_25', 'num_50',
'num_75', 'num_985', 'num_100',
'num_unq', 'total_secs', 'days']).days.count()
Part of the dataframe shows here:
msno date num_25 num_50 num_75 num_985 num_100 num_unq total_secs days
++orpnUqSevh2M5A97pRRiONA58g5m9DwaNrhD44HY0= 2016-08-14 78 13 3 3 77 84 18987.862 2 1
+0krxpTkQT7hciN95OEp7i2lyKvbXft887VNQGF6xN4= 2016-12-22 27 23 5 2 11 65 5946.577 35 1
98 1
+1eAmfPiXsMG0J+U/j7YTeXO+u0/ZhZqmoXfXKOY57I= 2016-12-17 8 2 2 2 126 131 32629.128 46 1
+1zyLKrstPYDOnoyooqqb0+VmmovONd9N3QZynT7rq0= 2016-07-14 18 2 1 3 46 66 10253.571 196 1
2016-09-17 11 3 1 6 81 45 12970.416 261 1
2016-11-17 5 1 3 2 8 19 2614.571 322 1
+28j7X4BShC9g2C4No4xAmzXuhLlmMbFDYWZcbblCEs= 2015-11-21 1 2 1 2 88 94 23580.548 45 1
2016-03-28 12 6 5 4 36 61 11596.887 173 1
+2GtgQNpMuIG0Vy1ycAbfla0zRni6NKtRLlyyuHHB8I= 2015-07-18 15 1 2 2 73 75 19179.451 102 1
2016-07-29 7 2 1 1 8 17 2515.486 479 1
2016-11-09 45 6 3 1 71 93 19865.317 582 1
So the first little problem I hope to solve is give each of 'msno'(userID) a number, not as a string type...
All these features num_25 num_50 num_75 num_985 num_100 num_unq
total_secs
are counting events happened in each day.
The last feature days
is calculated by days = date - transaction_date
means the number of days a user own the product. And the firsttransaction_date
for each user means the first time to buy this product. It comes from the dataframe belows,
msno payment_method_id \
0 QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ= 39
20 GBy8qSz16X5iYWD+3CMxv/Hm6OPSrXBYtmbnlRtknW0= 33
44 T0FF6lumjKcqEO0O+tUH2ytc+Kb9EkeaLzcVUiTr1aE= 40
72 9iW/UpqRoviya9CQh64RoYLPjiws8+biwscN+g5qYm4= 37
86 LUCi7i5FeNuZz4DB0zKu8J80rgr2uCO8YHCSW9PZfAk= 38
98 qCagaTL3UbMn0zvdMDDUoonbei70q1eASKrXa1cZGCs= 38
120 nsx5IGkCueevv0vFHB4jkG0HdRl6m6ltB8U9Guo5nS0= 40
123 nsx5IGkCueevv0vFHB4jkG0HdRl6m6ltB8U9Guo5nS0= 40
transaction_date membership_expire_date
0 2016-10-31 2016-12-21
20 2016-03-31 2016-04-30
44 2015-03-28 2015-04-27
72 2015-05-04 2015-06-03
86 2016-03-13 2016-08-24
98 2015-03-20 2015-04-19
120 2016-04-07 2016-05-06
123 2017-01-01 2017-02-06
You can see that in days
there are only a few values for each user. So is it possible to fill all other days of the event with 0 to expand the dataframe to a very sparse one? The rule I have to follow is to find the maximum value in days
column (In this sample is 582
) then fill all 582 days with 0s for every user under those events columns when no events happened.
From the title I said fill NaN because some of the users have multiple transcation_date
and membership_expire_date
records(See the last two rows). For example, I want to compute the difference between the second transcation_date
- the first membership_expire_date
(similar to how days
come) of the user then fill in NaN in this range time and also maybe the third starts - the second ends
(It can be a lot of range time for some users even about 30). So two types of values should be filled, 0 and NaN.
I don't know if I made these questions clearly and indeed that was a lot.. If anyone could help me out or even give me some hints, you must be my lifesaver!