Ok, I have a (big) dataframe, something like this:
date time value
0 20100201 0 1
1 20100201 6 2
2 20100201 12 3
3 20100201 18 4
4 20100202 0 5
5 20100202 6 6
6 20100202 12 7
7 20100202 18 8
8 20100203 0 9
9 20100203 18 11
10 20100204 6 12
...
8845 20160101 18 8846
As you can see, the dataframe has a column date
, a column time
with four hours for each day (00, 06, 12, 18) and a column value
.
The problem is that there are missing dates in the dataframe, in the example above there should be two extra rows between rows 8 and 9, corresponding to the hours 6
and 12
of the day 20100203
, and also an extra row between rows 9 and 10 corresponding to the hour 0
of the day 20100204
.
What would I need? I would like to iterate the date
column of the dataframe, checking that every day exists and no one is missing, and also that for every day there are the four hours (00, 06, 12, 18). In case that something is missing during the iteration there should be added in exactly that place, with the missing date
and time
and NaN
as a value. In order to not copy all the dataframe again, let me put the relevant aspects that there should appear in a final version:
...
7 20100202 18 8
8 20100203 0 9
9 20100203 6 NaN
10 20100203 12 NaN
11 20100203 18 11
12 20100204 0 NaN
13 20100204 6 12
...
In case you are interested, an easier version of this problem was asked here Modular arithmetic in python to iterate a pandas dataframe and kindly answered by users @Alexander and @piRSquared. The version asked here is a more difficult one, involving (I suppose) the use of datetime and timedelta and iterating more columns.
Sorry for the long post and thank you very much.
You can use
pivot
for reshaping - you getNaN
in missing values by columntime
, thenunstack
withreset_index
andsort_values
:Maybe you can
reset_index
again, if you need niceindex
like:Ok, thank you, it is almost almost done, there is something missing, I would need the dataframe to be ordered, i.e., for each day, beginning with 20100201, the first row for the 00 hour, the second for 06, the third for 12, the fourth for 18, then 20100202 beginning with 00 hour and so on until the final date in the year 2016... This order is important to be able to do some statistics with the data. Let me show you what I get:
(By the way, in the highly probable case of repetition in the
value
column, I suppose there is no problem, right? The solution is designed to eliminate the duplicates simultaneously in the other two columns, right?)