I have a dataframe
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 12 13 12 13 1 [1.5,3.5] 4 4.5
1 12 13 12 13 1 null 4.5 5
1 12 13 12 13 1 null 5 5.5
1 12 13 12 13 1 null 5.5 6
1 13 14 12 13 2 null 6 6.5
1 13 14 13 14 2 null 6.5 null
2 13 14 13 14 2 [0.5,1.5] 2.5 3.5
2 13 14 13 14 2 null 3.5 4
2 13 14 13 14 2 null 4 null
so I wanted to apply a condition while using groupby in agg function that if we do groupby col("id") and col("detector") then I want to check the condition that if lag_interval in that group has any non-null value then in aggregation I want two columns one is
min("lag_interval.col1") and other is max("lead_gpsdt")
If the above condition is not met then I want
min("gpsdt"), max("lead_gpsdt")
using this approach I want to get the data with a condition
df.groupBy("detector","id").agg(first("lat-long").alias("start_coordinate"),
last("lat-long").alias("end_coordinate"),struct(min("gpsdt"), max("lead_gpsdt")).as("interval"))
output
id interval start_coordinate end_coordinate
1 [1.5,6] [12,13] [13,14]
1 [6,6.5] [13,14] [13,14]
2 [0.5,4] [13,14] [13,14]
**
for more explanation
** if we see a part of what groupby("id","detector") does is taking a part out,
we have to see that if in that group of data if one of the value in the col("lag_interval") is not null then we need to use aggregation like this min(lag_interval.col1),max(lead_gpsdt) this condition will apply to below set of data
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 12 13 12 13 1 [1.5,3.5] 4 4.5
1 12 13 12 13 1 null 4.5 5
1 12 13 12 13 1 null 5 5.5
1 12 13 12 13 1 null 5.5 6
and if the all value of col("lag_interval") is null in that group of data then we need aggregation output as min("gpsdt"),max("lead_gpsdt") this condition will apply to below set of data
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 13 14 12 13 2 null 6 6.5
1 13 14 13 14 2 null 6.5 null