JavaRDD<String> history_ = sc.emptyRDD();
java.util.Queue<JavaRDD<String> > queue = new LinkedList<JavaRDD<String>>();
queue.add(history_);
JavaDStream<String> history_dstream = ssc.queueStream(queue);
JavaPairDStream<String,ArrayList<String>> history = history_dstream.mapToPair(r -> {
return new Tuple2< String,ArrayList<String> >(null,null);
});
JavaPairInputDStream<String, GenericData.Record> stream_1 =
KafkaUtils.createDirectStream(ssc, String.class, GenericData.Record.class, StringDecoder.class,
GenericDataRecordDecoder.class, props, topicsSet_1);
JavaPairInputDStream<String, GenericData.Record> stream_2 =
KafkaUtils.createDirectStream(ssc, String.class, GenericData.Record.class, StringDecoder.class,
GenericDataRecordDecoder.class, props, topicsSet_2);
then doing some transformation and creating twp DStream Data_1 and Data_2 of type
JavaPairDStream<String, <ArrayList<String>>
and do the join as below , then filtering out those records for whom there was no joining key and saving them in history for using it in next batch by doing its union with Data_1
Data_1 = Data_1.union(history);
JavaPairDStream<String, Tuple2<ArrayList<String>, Optional<ArrayList<String>>>> joined =
Data_1.leftOuterJoin(Data_2).cache();
JavaPairDStream<String, Tuple2<ArrayList<String>, Optional<ArrayList<String>>>> notNULL_join = joined.filter(r -> r._2._2().isPresent());
JavaPairDStream<String, Tuple2<ArrayList<String>, Optional<ArrayList<String>>>> dstream_filtered = joined.filter(r -> !r._2._2().isPresent());
history = dstream_filtered.mapToPair(r -> {
return new Tuple2<>(r._1,r._2._1);
}).persist;
I get history after the previous step(checked by saving it to hdfs) , but still this history is empty in batch while doing union.