My goal is to group objects based on time overlap.
Each object in my rdd
contains a start_time
and end_time
.
I'm probably going about this inefficiently but what I'm planning on doing is assigning an overlap id to each object based on if it has any time overlap with any of the other objects. I have the logic for time overlap down. Then, I hope to group by that overlap_id
.
So first,
mapped_rdd = rdd.map(assign_overlap_id)
final_rdd = mapped_rdd.reduceByKey(combine_objects)
Now this comes to my question. How can I go about writing the assign_overlap_id function?
def assign_overlap_id(x):
...
...
return (overlap_id, x)
Naive solution using Spark SQL and Data Frames:
Scala:
And the same thing using PySpark
Low level transformations with grouping by window
A little bit smarter approach is to generate candidate pairs using a window of some specified width. Here is a rather simplified solution:
Scala:
Python:
While it can be sufficient on some datasest for a production ready solution you should consider implementing some state-of-the-art algorithm like NCList.