I am facing a very strange issue in spark streaming. I am using spark 2.0.2
, number of nodes 3, number of executors 3 {1 receiver and 2 processor}, memory per executor 2 GB, cores per executor 1. The batch interval is 10 sec. My batch size is approx. 1000 records (approx 150 KB).
The processing time of my batch is increasing gradually from 2sec initially to few mins but for the first 40-50 hours it runs quite well. After that, scheduling delay and processing time start shooting up.
I had tried taking a look on the GC and there is a continuous increase in the old generation heap memory capacity of driver. Could this be the reason? I monitored heap memory using jstat. The increase in capacity is from 1161216 Bytes to 1397760 Bytes over a period of six hours.
The machine on which the driver is running has 8 physical cores and after 40-50 hours of streaming, the CPU usage on the machine is 100% on all the 8 cores and the old generation heap usage is full and the FullGC's are very frequent.
I had also seen a jira issue which says there is a memory leak in spark streaming but it also says that it was resolved after spark 1.5. Is this relevant ?
Edit:
I have also taken the heap dump approximately 50 hr after application start.
Why are there so many instances of scala.collection.immutable.$colon$colon
?