Iterating pandas dataframe, checking values and cr

2020-01-29 02:23发布

Ok, I have a (big) dataframe, something like this:

         date       time      value
0     20100201         0         1
1     20100201         6         2
2     20100201        12         3
3     20100201        18         4
4     20100202         0         5
5     20100202         6         6
6     20100202        12         7
7     20100202        18         8
8     20100203         0         9
9     20100203        18        11
10    20100204         6        12
...
8845  20160101        18      8846  

As you can see, the dataframe has a column date, a column time with four hours for each day (00, 06, 12, 18) and a column value.

The problem is that there are missing dates in the dataframe, in the example above there should be two extra rows between rows 8 and 9, corresponding to the hours 6 and 12 of the day 20100203, and also an extra row between rows 9 and 10 corresponding to the hour 0 of the day 20100204.

What would I need? I would like to iterate the date column of the dataframe, checking that every day exists and no one is missing, and also that for every day there are the four hours (00, 06, 12, 18). In case that something is missing during the iteration there should be added in exactly that place, with the missing date and time and NaN as a value. In order to not copy all the dataframe again, let me put the relevant aspects that there should appear in a final version:

...
7     20100202        18         8
8     20100203         0         9
9     20100203         6       NaN
10    20100203        12       NaN   
11    20100203        18        11
12    20100204         0       NaN
13    20100204         6        12
...

In case you are interested, an easier version of this problem was asked here Modular arithmetic in python to iterate a pandas dataframe and kindly answered by users @Alexander and @piRSquared. The version asked here is a more difficult one, involving (I suppose) the use of datetime and timedelta and iterating more columns.

Sorry for the long post and thank you very much.

2条回答
够拽才男人
2楼-- · 2020-01-29 03:02

You can use pivot for reshaping - you get NaN in missing values by column time, then unstack with reset_index and sort_values:

import pandas as pd

df = pd.DataFrame({'date': {0: 20100201, 1: 20100201, 2: 20100201, 3: 20100201, 4: 20100202, 5: 20100202, 6: 20100202, 7: 20100202, 8: 20100203, 9: 20100203, 10: 20100204}, 
                   'time': {0: 0, 1: 6, 2: 12, 3: 18, 4: 0, 5: 6, 6: 12, 7: 18, 8: 0, 9: 18, 10: 6},
                   'value': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 11, 10: 12}})

print (df)
        date  time  value
0   20100201     0      1
1   20100201     6      2
2   20100201    12      3
3   20100201    18      4
4   20100202     0      5
5   20100202     6      6
6   20100202    12      7
7   20100202    18      8
8   20100203     0      9
9   20100203    18     11
10  20100204     6     12
print (df.pivot(index='date', columns='time', values='value')
         .unstack()
         .reset_index(name='value')
         .sort_values('date'))

    time      date  value
0      0  20100201    1.0
4      6  20100201    2.0
8     12  20100201    3.0
12    18  20100201    4.0
1      0  20100202    5.0
5      6  20100202    6.0
9     12  20100202    7.0
13    18  20100202    8.0
2      0  20100203    9.0
6      6  20100203    NaN
10    12  20100203    NaN
14    18  20100203   11.0
3      0  20100204    NaN
7      6  20100204   12.0
11    12  20100204    NaN
15    18  20100204    NaN

Maybe you can reset_index again, if you need nice index like:

print (df.pivot(index='date', columns='time', values='value')
         .unstack()
         .reset_index(name='value')
         .sort_values('date')
         .reset_index(drop=True))

    time      date  value
0      0  20100201    1.0
1      6  20100201    2.0
2     12  20100201    3.0
3     18  20100201    4.0
4      0  20100202    5.0
5      6  20100202    6.0
6     12  20100202    7.0
7     18  20100202    8.0
8      0  20100203    9.0
9      6  20100203    NaN
10    12  20100203    NaN
11    18  20100203   11.0
12     0  20100204    NaN
13     6  20100204   12.0
14    12  20100204    NaN
15    18  20100204    NaN
查看更多
3楼-- · 2020-01-29 03:28

Ok, thank you, it is almost almost done, there is something missing, I would need the dataframe to be ordered, i.e., for each day, beginning with 20100201, the first row for the 00 hour, the second for 06, the third for 12, the fourth for 18, then 20100202 beginning with 00 hour and so on until the final date in the year 2016... This order is important to be able to do some statistics with the data. Let me show you what I get:

      time      date  value
   0     0  20100201  281.0
2224     6  20100201  278.0
4448    12  20100201  285.4
6672    18  20100201  287.6
2225     6  20100202  280.6
4449    12  20100202  287.2
6673    18  20100202  287.8
   1     0  20100202  282.4
   2     0  20100203  281.6
6674    18  20100203  287.8
4450    12  20100203  285.1
2226     6  20100203  281.0
6675    18  20100204  289.4
4451    12  20100204  286.8
   3     0  20100204  284.6
2227     6  20100204  284.2
...

(By the way, in the highly probable case of repetition in the value column, I suppose there is no problem, right? The solution is designed to eliminate the duplicates simultaneously in the other two columns, right?)

查看更多
登录 后发表回答