Finding cumulative features in dataframe?

2019-08-16 09:12发布

问题:

I have a datframe with around 200 features and 3000 rows. These data samples are logged in different time, basically one per month, as shown in the below example in “col101”:

   0    col1 (id)    col2.    col3   ….   col100    col101 (date)  …     col2000 (target value)
   1        001         653.    675   ….      343.3   01-02-2017.   …                1
   2        001         673.    432   ….      387.3   01-03-2017.   …            0
   3        001         679.    528   ….      401.2   01-04-2017.   …            1
   4        001         685     223   ….      503.4   01-05-2017.   …            1
   5        002         343     428   ….      432.5   01-02-2017.   …            0
   6        002         479.    421   ….      455.3   01-03-2017.   …            0
   7         …             …         …     ….          …               ….            …            .. 

Within these features some of are cumulative data so that in every month their values have been increased. For example, col2 and col100 are the cumulative features in my dataframe. So I want to add one more column for each cumulative feature, with the difference with respect to the previous month. So my desired dataframe should be something like this:

 0  col1 (id)    col2.   col2c   ….    col100     col100c  col101 (date)  …   col2000 (targeva)
 1      001         653.    653  ….    343.3       343.3    01-02-2017.   …            1
 2      001         673.    23   ….    387.3        44      01-03-2017.   …            0
 3      001         679.     6   ….    401.2        13.9    01-04-2017.   …            1
 4      001         685      6   ….    503.4       102.2    01-05-2017.   …            1
 5      002         343     343  ….    432.5       432.5    01-02-2017.   …            0
 6      002         479.    136  ….    455.3        23.2    01-03-2017.   …            0
 7       …             …         …     ….          …               ….            …            .. 

Now, I have two problems here: 1) how can I automatically recognize those cumulative features with 200 features? and how to add that extra feature (e.g., col22c and col100c) for each cumulative attribute? Does anyone know how I can handle this?

回答1:

About differentiating two columns, you can use pandas built-in diff() function. diff() calculates the difference of each element compared to the previous one. But note that because the first element doesn't have any previous element, the first element in the result of diff() would be NaN. So we use the built-in function dropna() to drop all NaN values.

But as for detecting the cumulative columns, I don't think there would be any way. You CAN find all the columns that are always increasing (monotonic), but that doesn't mean they are cumulative necessarily.

Anyway for detecting the monotonic columns, you can first get their diff().dropna() and then check if all if these values are positive:

df = some_data_frame
col_diff = df['some_column'].diff().dropna()
is_monotonic = all(col_diff > 0)

Note that if you forget the dropna(), the result of all(col_diff > 0) would always be False (because NaN is a Falsy value)