How to fill 0 and NaN for a dataframe after groupb

2019-08-17 00:22发布

问题:

I have a dataframe looks like this after I groupby them.

gb = new_user_log.groupby(['msno', 'date', 'num_25', 'num_50',
                          'num_75', 'num_985', 'num_100',
                          'num_unq', 'total_secs', 'days']).days.count()

Part of the dataframe shows here:

  msno                                          date        num_25  num_50  num_75  num_985  num_100  num_unq  total_secs  days
    ++orpnUqSevh2M5A97pRRiONA58g5m9DwaNrhD44HY0=  2016-08-14  78      13      3       3        77       84       18987.862   2       1
    +0krxpTkQT7hciN95OEp7i2lyKvbXft887VNQGF6xN4=  2016-12-22  27      23      5       2        11       65       5946.577    35      1
                                                                                                                             98      1
    +1eAmfPiXsMG0J+U/j7YTeXO+u0/ZhZqmoXfXKOY57I=  2016-12-17  8       2       2       2        126      131      32629.128   46      1
    +1zyLKrstPYDOnoyooqqb0+VmmovONd9N3QZynT7rq0=  2016-07-14  18      2       1       3        46       66       10253.571   196     1
                                                  2016-09-17  11      3       1       6        81       45       12970.416   261     1
                                                  2016-11-17  5       1       3       2        8        19       2614.571    322     1
    +28j7X4BShC9g2C4No4xAmzXuhLlmMbFDYWZcbblCEs=  2015-11-21  1       2       1       2        88       94       23580.548   45      1
                                                  2016-03-28  12      6       5       4        36       61       11596.887   173     1
    +2GtgQNpMuIG0Vy1ycAbfla0zRni6NKtRLlyyuHHB8I=  2015-07-18  15      1       2       2        73       75       19179.451   102     1
                                                  2016-07-29  7       2       1       1        8        17       2515.486    479     1
                                                  2016-11-09  45      6       3       1        71       93       19865.317   582     1

So the first little problem I hope to solve is give each of 'msno'(userID) a number, not as a string type...

All these features num_25 num_50 num_75 num_985 num_100 num_unq total_secs are counting events happened in each day.

The last feature days is calculated by days = date - transaction_date means the number of days a user own the product. And the firsttransaction_date for each user means the first time to buy this product. It comes from the dataframe belows,

                                                msno  payment_method_id  \
0       QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=                 39   
20      GBy8qSz16X5iYWD+3CMxv/Hm6OPSrXBYtmbnlRtknW0=                 33   
44      T0FF6lumjKcqEO0O+tUH2ytc+Kb9EkeaLzcVUiTr1aE=                 40   
72      9iW/UpqRoviya9CQh64RoYLPjiws8+biwscN+g5qYm4=                 37   
86      LUCi7i5FeNuZz4DB0zKu8J80rgr2uCO8YHCSW9PZfAk=                 38   
98      qCagaTL3UbMn0zvdMDDUoonbei70q1eASKrXa1cZGCs=                 38   
120     nsx5IGkCueevv0vFHB4jkG0HdRl6m6ltB8U9Guo5nS0=                 40   
123     nsx5IGkCueevv0vFHB4jkG0HdRl6m6ltB8U9Guo5nS0=                 40   


      transaction_date membership_expire_date  
0            2016-10-31             2016-12-21    
20           2016-03-31             2016-04-30  
44           2015-03-28             2015-04-27  
72           2015-05-04             2015-06-03   
86           2016-03-13             2016-08-24  
98           2015-03-20             2015-04-19   
120          2016-04-07             2016-05-06   
123          2017-01-01             2017-02-06   

You can see that in days there are only a few values for each user. So is it possible to fill all other days of the event with 0 to expand the dataframe to a very sparse one? The rule I have to follow is to find the maximum value in days column (In this sample is 582) then fill all 582 days with 0s for every user under those events columns when no events happened.

From the title I said fill NaN because some of the users have multiple transcation_date and membership_expire_date records(See the last two rows). For example, I want to compute the difference between the second transcation_date - the first membership_expire_date(similar to how days come) of the user then fill in NaN in this range time and also maybe the third starts - the second ends(It can be a lot of range time for some users even about 30). So two types of values should be filled, 0 and NaN.

I don't know if I made these questions clearly and indeed that was a lot.. If anyone could help me out or even give me some hints, you must be my lifesaver!