In SparkStreaming should we off load the saving part to another layer because SparkStreaming context is not available when we use SparkCassandraConnector if our database is cassandra. Moreover, even if we use some other database to save our data then we need to create connection on the worker every time we process a batch of rdds. Reason being connection objects are not serialized.
Is it recommended to create/close connections at workers?
It would make our system tightly coupled with the existing database tomorrow we may change the database
To answer your questions:
- Yes, it is absolutely fine to create/close connections at workers.
But, make sure you don't do it for each and every record. It is
recommended to do it at the partition level or at a level where
connections are created/closed for a group of records.
- You can decouple it by passing a variable and deciding on the type of DB connection at runtime.
Possible duplicate of:
Handle database connection inside spark streaming
Read this link, it should clarify some of you questions
Design Patterns for using foreachRDD
Hope this help!