I try to find missing timestamp. Here are a lot of solutions to fix this single problem. Nevertheless I also want to find "where" timestamp by ID is missing.
So for example the test-dataset would look like this:
elemuid timestamp
1232 2018-02-10 23:00:00
1232 2018-02-10 23:01:00
1232 2018-02-10 22:58:00
1674 2018-02-10 22:40:00
1674 2018-02-10 22:39:00
1674 2018-02-10 22:37:00
1674 2018-02-10 22:35:00
And the solution should be like:
elemuid timestamp
1232 2018-02-10 22:59:00
1674 2018-02-10 22:38:00
1674 2018-02-10 22:36:00
My problem is that I can only use dplyr
, because I would like to use this code also in sparklyr
.
I would be really happy for your help!
For the simplicity let's assume you've already followed the instructions from your previous question, and computed minimum and maximum (
min_max
) Epoch time in seconds.The remaining steps are quite similar to the ones we followed before:
Generate range of values:
Compute distinct
elemuid
Now, we want to generate a reference table as a Cartesian product of the range and unique ids:
You might be tempted to use more
dplyr
-ish method:This would work fine if size of the product is small (in that case Spark should use broadcast join), but will cause complete data skew otherwise so it is not safe to use in general.
Finally we'll outer join data as before:
to fill out things, or (as already explained in the answer provided by akrun) anti join to remove missing points:
Here is one option with
anti_join
. Assuming that the 'timestamp' column is not a Datetime object, we convert it toPOSIXct
grouped by 'elemuid', use
complete
to expand the 'timestamp' by 1 minute and do ananti_join
with the original dataset